Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Ühel kenal päeval, E, 2007-09-10 kell 11:01, kirjutas Alvaro Herrera: > Mark Mielke wrote: > > Simon Riggs wrote: > >> ISTM we would be able to do this fairly well if we implemented > >> Index-only columns. i.e. columns that don't exist in the heap, only in > >> an index. > >> Taken to the extreme, all columns could be removed from the heap and > >> placed in an index(es). Only the visibility information would remain on > >> the heap. > >> > > Wouldn't the extreme be - even the visibility information is in the index? > > Maybe we could extract the visibility information and put it in a > storage separate from the heap. A seqscan would be a bit slower (have > to check the heap and the visibility storage) but some index scans could > be faster (can check the index and visibility storage, skipping the > heap), and updates would be faster too (append into the heap, update > visibility storage). This would allow something closer to index-only > scans. Of course, the key is to allow efficient lookup of visibility > info given a TID. > > Has this been explored before? I wrote to this list about a vague plan for doing it some months ago. And I don't think that even seqscans need to be slower, as at least for typical Data Warehouse application the visibility info can be compressed a lot by even a simple RLE compression and thus you just do the sequscan like this while 1: get next visible range read in the visible range, without checking each tuple I suspect that this can be made much more cache-friendly than current scheme. (A Guess: I even go as far as to claim that separate visibility heap _may_ be more cahce friendly even uncompressed, so if we have every second tuple deleted, it will still be faster (at least for bigger tuples) to look up visibility in a dense visibility array and then access just the visible tuples. Here even separate heap may be not needed, just rearranging the heap structure to have visibility info in the beginning of page together with tuple pointers may be a win ) Another bonus of separate visibility heap is that it can be maintained separately, that is it can be shuffled around, CID's cleaned up, compressed, etc. much more effectively than current one, which, among other things will make VACUUM much more efficient, as neither all-visible or all-dead pages need to be visited at all. It can probably be made to serve as both DSM and VACUUM map also. And it can be used for finding "best" insert spot for CLUSTERing-preserving inserts/updates. On subject of vertical or column-per-heap storage I see the biggest obstacle to be te use of TID-s as tuple identifiers, as the "real" TIDs (pageno32:tupleno:16) will be different for each column. So if we have a table with a 1 kb text column, we will also have to store just 8 ints per page in the int column if we want to preserve unique TID->tuple mapping. What we could do of course is start defining TID as just a 48-bit tuple number and then have some rules for finding the PAGE:NR "tid" from that it would be easy for fixed-size columns (PAGE:NR) = (TID/items_per_page:TID mod items_per_page) but more complicated for VARLEN fields. All the ideas I've had sofar for solving this have quite a lot of downsides. -- Hannu ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Avery Payne wrote: > >I thought maybe we can call it COAST, Column-oriented attribute storage > technique, :-) > > I like it. :-)http://www.2ndQuadrant.com";> I > just wish I would have read this before applying for a project name > at pgfoundry, the current proposal is given as "pg-cstore". You can email the pgfoundry admins at [EMAIL PROTECTED] and ask it to be cancelled instead of approved, and resubmit with the other name. -- Alvaro Herrera http://www.PlanetPostgreSQL.org/ "In fact, the basic problem with Perl 5's subroutines is that they're not crufty enough, so the cruft leaks out into user-defined code instead, by the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
>ISTM we would be able to do this fairly well if we implemented >Index-only columns. i.e. columns that don't exist in the heap, only in >an index. >Taken to the extreme, all columns could be removed from the heap and >placed in an index(es). Only the visibility information would remain on >the heap. So, let me understand this correctly - you want to index the columns and use the index to reconstruct the data? Some kind of "implicit" reconstruction? >Doing this per column would be a big win over vertical databases >since AFAIK they *have* to do this to *every* column, even if it is not >beneficial to do so. I was thinking about something a little more crude - each column being a free-standing table, but being "viewed" by the client as a single entity, a kind of "data federation". The idea was that the existing storage mechanism wouldn't be altered, and with a little slight-of-hand, we could extend the mechanism without hampering things like clustering, future extensions, optimizations, etc. No changes to MVCC would be needed, because it would above and through it. The idea was that for a "wide" table you target only a few columns, so you could sequential read without the penalty of having to seek to a new offset for each record. Instead of processing all the columns, you process a single column, which means less data to read for the same number of records. That gain starts to slope off when you specify more and more columns from the table. Throw in selective indexing (similar to what you were talking about) and suddenly we can reclaim some of that lost speed. We'll make a kind of "compressed index", where the key turns into a hash that points to (is attached to?) a bucket, and the bucket contains the offset of all the records that relate to that key. Other tricks can be employed, such as run-length encoding entire ranges of offsets, etc. to keep this really really small. Really small = faster and faster to read, using more CPU than I/O. And I/O is still more of an issue than CPU at this point. Then again, if you ditch the column altogether and use a "compressed index" to reconstruct data implicitly, now we're close to what you were talking about (assuming I understand you correctly and also assuming that PostgreSQL doesn't already do this with indexes). So, if the column is indexed, then maybe split it off into a "compressed index", and if not, keep it in the main table outright? I guess I really need to think about this a bit more before I start delving into code. >I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-) I like it. :-) I just wish I would have read this before applying for a project name at pgfoundry, the current proposal is given as "pg-cstore".
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Mark Mielke wrote: > Simon Riggs wrote: >> ISTM we would be able to do this fairly well if we implemented >> Index-only columns. i.e. columns that don't exist in the heap, only in >> an index. >> Taken to the extreme, all columns could be removed from the heap and >> placed in an index(es). Only the visibility information would remain on >> the heap. >> > Wouldn't the extreme be - even the visibility information is in the index? Maybe we could extract the visibility information and put it in a storage separate from the heap. A seqscan would be a bit slower (have to check the heap and the visibility storage) but some index scans could be faster (can check the index and visibility storage, skipping the heap), and updates would be faster too (append into the heap, update visibility storage). This would allow something closer to index-only scans. Of course, the key is to allow efficient lookup of visibility info given a TID. Has this been explored before? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Simon Riggs wrote: ISTM we would be able to do this fairly well if we implemented Index-only columns. i.e. columns that don't exist in the heap, only in an index. Taken to the extreme, all columns could be removed from the heap and placed in an index(es). Only the visibility information would remain on the heap. Wouldn't the extreme be - even the visibility information is in the index? :-) Cheers, mark -- Mark Mielke <[EMAIL PROTECTED]> ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
On Fri, 2007-09-07 at 13:52 -0700, Avery Payne wrote: > So I've been seeing/hearing all of the hoopla over vertical databases > (column stores), and how they'll not only slice bread but also make > toast, etc. I've done some quick searches for past articles on > "C-Store", "Vertica", "Column Store", and "Vertical Database", and have > seen little discussion on this. I looked at doing this a while back, for similar reasons. ISTM we would be able to do this fairly well if we implemented Index-only columns. i.e. columns that don't exist in the heap, only in an index. Taken to the extreme, all columns could be removed from the heap and placed in an index(es). Only the visibility information would remain on the heap. Syntax for this would be an ALTER TABLE SET STORAGE command, with a new type of storage definition that will only be accepted if an index already has been defined on the table which includes the specified column. Doing this per column would be a big win over vertical databases since AFAIK they *have* to do this to *every* column, even if it is not beneficial to do so. Every existing index plan works immediately. The main annoyance is retrieving a column value that doesn't exist on the heap. That would require a new kind of plan that involves preparing the index(es) by sorting them on htid and then doing a left merge join with the main heap. By now, most people will be screaming at their monitors "what an idiot, thats gonna REALLY suck". True, but this is the same thing that column-oriented databases have to do also, so it would be taking the best that vertical databases have to offer and accepting the consequences. There are some other plan possibilities also, apart from this basic value retrieval, but they would require further index API changes; I'm not certain of those as being primary use cases, however. Vertical databases honestly do have their uses and there are many kinds of marketing query that have complex where clauses yet only simple select clauses. There are a number of industry-specific database products that have utilised this technique to good effect for a number of years. So ISTM the main changes would be executor changes to allow retrieving column values of index-only columns when they are required, and to modify the insert/update tuple code somewhat. I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-) -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
On 9/7/07, Avery Payne <[EMAIL PROTECTED]> wrote: > > Avery, > > > > >If someone writes the rest of the code, I doubt the syntax will be the > >holdup. But writing an efficient C-store table mechanism is much harder > >than I think you think it is; Vertica worked on it for a year and failed, > >and Paraccel took two years to succeed. FYI, Paraccel is based on > >Postgres. > > >So, put up a pgfoundry project and start hacking a c-store table; I'm sure > >you;ll get interest if you can make something work. > > >-- > >--Josh > > Well, I did say it was a *crazy* idea. :-) > > Given that I would be starting from the ground-floor, learning not only > the innards of PostgreSQL but also C coding as well, I would probably > have to overcome near-insurmountable odds to make this project take off. > Still, > if I was crazy enough to think it, maybe I'll be crazy enough to > try for it. ;-) > > Just ignore my 2nd posting, I was trying to clarify some of the > ramblings I was typing. For whatever it's worth, I was reading about the same things today and came up with the same basic idea, without the same level of implementation details you've talked about, Avery. And it sounds really neat. Hard, but neat, and potentially worth it if, say, Paraccel doesn't open source their stuff first :) But I don't know the PostgreSQL internals either, though I've been known to throw together some C code now and again. So in short, there's interest. Whether there's collective skill/free time/motivation/insanity/etc. enough to make something useful of the interest is another question altogether. :) - Josh/eggyknap ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Avery, >If someone writes the rest of the code, I doubt the syntax will be the >holdup. But writing an efficient C-store table mechanism is much harder >than I think you think it is; Vertica worked on it for a year and failed, >and Paraccel took two years to succeed. FYI, Paraccel is based on >Postgres. >So, put up a pgfoundry project and start hacking a c-store table; I'm sure >you;ll get interest if you can make something work. >-- >--Josh Well, I did say it was a *crazy* idea. :-) Given that I would be starting from the ground-floor, learning not only the innards of PostgreSQL but also C coding as well, I would probably have to overcome near-insurmountable odds to make this project take off. Still, if I was crazy enough to think it, maybe I'll be crazy enough to try for it. ;-) Just ignore my 2nd posting, I was trying to clarify some of the ramblings I was typing.
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Avery, > Make one small, very tiny syntactic change to "CREATE TABLE" that > includes a new keyword, "COLUMN-STORE" or something similar. If someone writes the rest of the code, I doubt the syntax will be the holdup. But writing an efficient C-store table mechanism is much harder than I think you think it is; Vertica worked on it for a year and failed, and Paraccel took two years to succeed. FYI, Paraccel is based on Postgres. So, put up a pgfoundry project and start hacking a c-store table; I'm sure you;ll get interest if you can make something work. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly