Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-12 Thread Hannu Krosing
Ühel kenal päeval, E, 2007-09-10 kell 11:01, kirjutas Alvaro Herrera:
> Mark Mielke wrote:
> > Simon Riggs wrote:
> >> ISTM we would be able to do this fairly well if we implemented
> >> Index-only columns. i.e. columns that don't exist in the heap, only in
> >> an index. 
> >> Taken to the extreme, all columns could be removed from the heap and
> >> placed in an index(es). Only the visibility information would remain on
> >> the heap.
> >>   
> > Wouldn't the extreme be - even the visibility information is in the index? 
> 
> Maybe we could extract the visibility information and put it in a
> storage separate from the heap.  A seqscan would be a bit slower (have
> to check the heap and the visibility storage) but some index scans could
> be faster (can check the index and visibility storage, skipping the
> heap), and updates would be faster too (append into the heap, update
> visibility storage).  This would allow something closer to index-only
> scans.  Of course, the key is to allow efficient lookup of visibility
> info given a TID.
> 
> Has this been explored before?

I wrote to this list about a vague plan for doing it some months ago.

And I don't think that even seqscans need to be slower, as at least for
typical Data Warehouse application the visibility info can be compressed
a lot by even a simple RLE compression and thus you just do the sequscan
like this

while 1:
get next visible range
read in the visible range, without checking each tuple

I suspect that this can be made much more cache-friendly than current
scheme.

(A Guess: I even go as far as to claim that separate visibility heap
_may_ be more cahce friendly even uncompressed, so if we have every
second tuple deleted, it will still be faster (at least for bigger
tuples) to look up visibility in a dense visibility array and then
access just the visible tuples. Here even separate heap may be not
needed, just rearranging the heap structure to have visibility info in
the beginning of page together with tuple pointers may be a win )

Another bonus of separate visibility heap is that it can be maintained
separately, that is it can be shuffled around, CID's cleaned up,
compressed, etc. much more effectively than current one, which, among
other things will make VACUUM much more efficient, as neither
all-visible or all-dead pages need to be visited at all.

It can probably be made to serve as both DSM and VACUUM map also. And it
can be used for finding "best" insert spot for CLUSTERing-preserving
inserts/updates.


On subject of vertical or column-per-heap storage I see the biggest
obstacle to be te use of TID-s as tuple identifiers, as the "real" TIDs
(pageno32:tupleno:16) will be different for each column. So if we have a
table with a 1 kb text column, we will also have to store just 8 ints
per page in the int column if we want to preserve unique TID->tuple
mapping.

What we could do of course is start defining TID as just a 48-bit tuple
number and then have some rules for finding the PAGE:NR "tid" from that

it would be easy for fixed-size columns 
(PAGE:NR) = (TID/items_per_page:TID mod items_per_page)
but more complicated for VARLEN fields.

All the ideas I've had sofar for solving this have quite  a lot of
downsides. 

--
Hannu








---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Alvaro Herrera
Avery Payne wrote:

> >I thought maybe we can call it COAST, Column-oriented attribute storage 
> technique, :-)
> 
> I like it. :-)http://www.2ndQuadrant.com";>  I 
> just wish I would have read this before applying for a project name
> at pgfoundry, the current proposal is given as "pg-cstore".

You can email the pgfoundry admins at [EMAIL PROTECTED] and
ask it to be cancelled instead of approved, and resubmit with the other
name.

-- 
Alvaro Herrera   http://www.PlanetPostgreSQL.org/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Avery Payne






>ISTM we would be able to do this fairly well if we implemented
>Index-only columns. i.e. columns that don't exist in the heap, only in
>an index. 

>Taken to the extreme, all columns could be removed from the heap and
>placed in an index(es). Only the visibility information would remain on
>the heap.

So, let me understand this correctly - you want to index the columns and
use the index to reconstruct the data?  Some kind of "implicit" reconstruction?

>Doing this per column would be a big win over vertical databases
>since AFAIK they *have* to do this to *every* column, even if it is not
>beneficial to do so.



I was thinking about something a little more crude - each column being a free-standing
table, but being "viewed" by the client as a single entity, a kind of "data federation".
The idea was that the existing storage mechanism wouldn't be altered, and
with a little slight-of-hand, we could extend the mechanism without hampering things
like clustering, future extensions, optimizations, etc.  No changes to MVCC would be
needed, because it would above and through it.

The idea was that for a "wide" table you target only a few columns, so you could
sequential read without the penalty of having to seek to a new offset for each record.
Instead of processing all the columns, you process a single column, which means less
data to read for the same number of records.  That gain starts to slope off
when you specify more and more columns from the table.

Throw in selective indexing (similar to what you were talking about) and suddenly we can reclaim
some of that lost speed.  We'll make a kind of "compressed index", where the key turns into
a hash that points to (is attached to?) a bucket, and the bucket contains the offset of all the
records that relate to that key.  Other tricks can be employed, such as run-length encoding
entire ranges of offsets, etc. to keep this really really small.  Really small = faster and faster
to read, using more CPU than I/O.  And I/O is still more of an issue than CPU at this point.

Then again, if you ditch the column altogether and use a "compressed index" to reconstruct
data implicitly, now we're close to what you were talking about (assuming I understand you
correctly and also assuming that PostgreSQL doesn't already do this with indexes).  So, if the
column is indexed, then maybe split it off into a "compressed index", and if not, keep it in the main
table outright?

I guess I really need to think about this a bit more before I start delving into code.



>I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-)

I like it. :-)  I just wish I would have read this before applying for a project name
at pgfoundry, the current proposal is given as "pg-cstore".







Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Alvaro Herrera
Mark Mielke wrote:
> Simon Riggs wrote:
>> ISTM we would be able to do this fairly well if we implemented
>> Index-only columns. i.e. columns that don't exist in the heap, only in
>> an index. 
>> Taken to the extreme, all columns could be removed from the heap and
>> placed in an index(es). Only the visibility information would remain on
>> the heap.
>>   
> Wouldn't the extreme be - even the visibility information is in the index? 

Maybe we could extract the visibility information and put it in a
storage separate from the heap.  A seqscan would be a bit slower (have
to check the heap and the visibility storage) but some index scans could
be faster (can check the index and visibility storage, skipping the
heap), and updates would be faster too (append into the heap, update
visibility storage).  This would allow something closer to index-only
scans.  Of course, the key is to allow efficient lookup of visibility
info given a TID.

Has this been explored before?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Mark Mielke

Simon Riggs wrote:

ISTM we would be able to do this fairly well if we implemented
Index-only columns. i.e. columns that don't exist in the heap, only in
an index. 


Taken to the extreme, all columns could be removed from the heap and
placed in an index(es). Only the visibility information would remain on
the heap.
  
Wouldn't the extreme be - even the visibility information is in the 
index? :-)


Cheers,
mark

--
Mark Mielke <[EMAIL PROTECTED]>

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Simon Riggs
On Fri, 2007-09-07 at 13:52 -0700, Avery Payne wrote:

> So I've been seeing/hearing all of the hoopla over vertical databases 
> (column stores), and how they'll not only slice bread but also make 
> toast, etc.  I've done some quick searches for past articles on 
> "C-Store", "Vertica", "Column Store", and "Vertical Database", and have 
> seen little discussion on this. 

I looked at doing this a while back, for similar reasons.

ISTM we would be able to do this fairly well if we implemented
Index-only columns. i.e. columns that don't exist in the heap, only in
an index. 

Taken to the extreme, all columns could be removed from the heap and
placed in an index(es). Only the visibility information would remain on
the heap.

Syntax for this would be an ALTER TABLE SET STORAGE command, with a new
type of storage definition that will only be accepted if an index
already has been defined on the table which includes the specified
column. Doing this per column would be a big win over vertical databases
since AFAIK they *have* to do this to *every* column, even if it is not
beneficial to do so.

Every existing index plan works immediately. The main annoyance is
retrieving a column value that doesn't exist on the heap. That would
require a new kind of plan that involves preparing the index(es) by
sorting them on htid and then doing a left merge join with the main
heap. By now, most people will be screaming at their monitors "what an
idiot, thats gonna REALLY suck". True, but this is the same thing that
column-oriented databases have to do also, so it would be taking the
best that vertical databases have to offer and accepting the
consequences. There are some other plan possibilities also, apart from
this basic value retrieval, but they would require further index API
changes; I'm not certain of those as being primary use cases, however.

Vertical databases honestly do have their uses and there are many kinds
of marketing query that have complex where clauses yet only simple
select clauses. There are a number of industry-specific database
products that have utilised this technique to good effect for a number
of years.

So ISTM the main changes would be executor changes to allow retrieving
column values of index-only columns when they are required, and to
modify the insert/update tuple code somewhat. I thought maybe we can
call it COAST, Column-oriented attribute storage technique, :-)

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Josh Tolley
On 9/7/07, Avery Payne <[EMAIL PROTECTED]> wrote:
>
>  Avery,
>
> 
>
> >If someone writes the rest of the code, I doubt the syntax will be the
> >holdup. But writing an efficient C-store table mechanism is much harder
> >than I think you think it is; Vertica worked on it for a year and failed,
> >and Paraccel took two years to succeed. FYI, Paraccel is based on
> >Postgres.
>
> >So, put up a pgfoundry project and start hacking a c-store table; I'm sure
> >you;ll get interest if you can make something work.
>
> >--
> >--Josh
>
> Well, I did say it was a *crazy* idea. :-)
>
> Given that I would be starting from the ground-floor, learning not only
> the innards of PostgreSQL but also C coding as well, I would probably
> have to overcome near-insurmountable odds to make this project take off.
> Still,
> if I was crazy enough to think it, maybe I'll be crazy enough to
> try for it. ;-)
>
> Just ignore my 2nd posting, I was trying to clarify some of the
> ramblings I was typing.

For whatever it's worth, I was reading about the same things today and
came up with the same basic idea, without the same level of
implementation details you've talked about, Avery. And it sounds
really neat. Hard, but neat, and potentially worth it if, say,
Paraccel doesn't open source their stuff first :)

But I don't know the PostgreSQL internals either, though I've been
known to throw together some C code now and again. So in short,
there's interest. Whether there's collective skill/free
time/motivation/insanity/etc. enough to make something useful of the
interest is another question altogether. :)

- Josh/eggyknap

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Avery Payne




Avery,



>If someone writes the rest of the code, I doubt the syntax will be the 
>holdup.  But writing an efficient C-store table mechanism is much harder 
>than I think you think it is; Vertica worked on it for a year and failed, 
>and Paraccel took two years to succeed.   FYI, Paraccel is based on 
>Postgres.

>So, put up a pgfoundry project and start hacking a c-store table; I'm sure 
>you;ll get interest if you can make something work.

>-- 
>--Josh

Well, I did say it was a *crazy* idea. :-)

Given that I would be starting from the ground-floor, learning not only
the innards of PostgreSQL but also C coding as well, I would probably
have to overcome near-insurmountable odds to make this project take off.  Still,
if I was crazy enough to think it, maybe I'll be crazy enough to 
try for it. ;-)

Just ignore my 2nd posting, I was trying to clarify some of the 
ramblings I was typing.





Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Josh Berkus
Avery,

> Make one small, very tiny syntactic change to "CREATE TABLE" that
> includes a new keyword, "COLUMN-STORE" or something similar. 

If someone writes the rest of the code, I doubt the syntax will be the 
holdup.  But writing an efficient C-store table mechanism is much harder 
than I think you think it is; Vertica worked on it for a year and failed, 
and Paraccel took two years to succeed.   FYI, Paraccel is based on 
Postgres.

So, put up a pgfoundry project and start hacking a c-store table; I'm sure 
you;ll get interest if you can make something work.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly