Re: [HACKERS] Time-correlated columns in large tables
Jeroen, On 3/5/07 12:39 PM, "Jeroen T. Vermeulen" <[EMAIL PROTECTED]> wrote: > I guess if you did simple run-length compression on these bitmaps you'd > end up more or less where I came in. But you wouldn't want to flip a bit > somewhere in the middle of a compressed data stream, of course. :- We handle that by doing a recompression in page if possibly, page splitting if not. Jie/Gavin's work will initially be an equality encoded bitmap as Heikki indicates, soon after we can implement range encoding, etc. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Time-correlated columns in large tables
On Tue, March 6, 2007 03:17, Heikki Linnakangas wrote: > I think you've just described a range-encoded bitmap index. The idea is > to divide the range of valid values into a some smallish number of > subranges, and for each of these boundary values you store a bitmap > where you set the bit representing every tuple with value < boundary > value. That's pretty cool! From the looks of it, what you describe would solve my problem--but using more storage in return for more flexibility. My scheme really required a correlation between a value and storage order, which can be pretty fragile. These range-encoded bitmap indexes wouldn't have that problem. I guess if you did simple run-length compression on these bitmaps you'd end up more or less where I came in. But you wouldn't want to flip a bit somewhere in the middle of a compressed data stream, of course. :-) Jeroen ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Time-correlated columns in large tables
Jeroen T. Vermeulen wrote: [Q: Is there some other transparent optimization for values that correlate with insertion/update order?] So I was wondering whether it would make sense to have a more compact kind of index. One that partitions the value range of a given column into sub-ranges, and for each of those, merely keeps loose track of where those value ranges might be found. "Dates from 2006-07-15 to 2006-08-04: try blocks 99-126 and 175." Just a fixed maximum of two or three contiguous block ranges per value range would probably do the trick. The risk of having too few, of course, is that one oddly positioned block could make the index look as if a particular value range was spread out throughout most of the table. I think you've just described a range-encoded bitmap index. The idea is to divide the range of valid values into a some smallish number of subranges, and for each of these boundary values you store a bitmap where you set the bit representing every tuple with value < boundary value. For example, imagine a simple table like this: position | key -+- 1 |1 2 |2 3 |3 ... 10 | 10 If you divide this into for example 4 subranges, you'd get bitmaps key bitmap -+--- <= 3 | 111000 <= 6 | 11 <= 8 | 00 <= 10| 11 Now to find all tuples <= 3, you'd fetch all tuples in the first bitmap. To find all tuples <= 2, you'd fetch all tuples in the first bitmap, and recheck the <= 2 condition. To find anything between say 4 and 8, you'd take the XOR of the 3rd and 1st bitmap, and recheck the > 4 condition on resulting tuples. So basically this allows you to satisfy any range query by reading two bitmaps and XORing them together. The resulting tuples will generally need to be rechecked because the index is lossy. I *really* hope the equality-encoded bitmap index gets into 8.3. I've been trying to keep range-encoded bitmaps in mind when looking at the proposed indexam API changes. Hopefully we'll see them in 8.4. Feel free to submit a patch ;-). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Time-correlated columns in large tables
I'm a bit embarrassed to bring this up here because I don't know much about storage layout and indexing. It's probably a silly notion, but if so, could someone please tell me how and why? First I'll describe the situation that leads me to write this. I'm seeing some performance problems in an application that keeps a large table of logged events. Rows aren't normally ever updated, but new ones are inserted all the time. There are several fields containing various kinds of timestamps, some exact and some sloppy. These and perhaps other columns are loosely correlated with the order in which the rows are inserted. [Q: Does this happen to a lot of users? Is it worth bothering with?] Naturally the table is frequently queried for ranges of timestamps of one kind of another. It's probably not atypical. Some of the timestamp columns are indexed, but I'm concerned about both the size of the indexes and the cost they add to inserts. Both queries using the indexes and insertions can become unacceptably slow sometimes. [Q: Are the access and insertion costs of an index really non-negligible compared to those of the table itself?] It sounded to me like this might possibly be a job for spatial indexes (degraded down to a single dimension), but I couldn't find enough documentation to figure out whether they generalize to this usage. From what I found, it didn't sound likely. [Q: Do spatial indexes work on simple scalar values and operators? If not, any reason why not? Otherwise, would they help?] Bruce suggested partitioning the table, but that won't work well for multiple independent criteria and comes with a host of scalability and management issues. I'd also want something that was transparent to the application. Perhaps I'm asking for too much but AFAIK it's exactly that ability to separate functionality from optimization that made SQL a success in the first place. [Q: Is there some other transparent optimization for values that correlate with insertion/update order?] So I was wondering whether it would make sense to have a more compact kind of index. One that partitions the value range of a given column into sub-ranges, and for each of those, merely keeps loose track of where those value ranges might be found. "Dates from 2006-07-15 to 2006-08-04: try blocks 99-126 and 175." Just a fixed maximum of two or three contiguous block ranges per value range would probably do the trick. The risk of having too few, of course, is that one oddly positioned block could make the index look as if a particular value range was spread out throughout most of the table. [Q: Am I re-inventing the wheel? If not, is there really a robust, linear correlation between a row's time of insertion or last update and its storage block number?] The index would not need to be exact, just a bit pessimistic. Any re-partitioning or re-balancing could be done offline in "vacuum analyze" style. AFAICS that would allow O(1) insertion cost at runtime, or more aggressive tradeoffs to reduce run-time degradation of indexing quality. Most maintenance could be done incrementally to smooth out its costs over time. With an "index" like this, an index scan would be very cheap indeed but you'd then also scan a small subset of the table itself for exact matches. I would hope that would be a win if the indexed value accounts for more than an infinitesimal portion of table size, and is reasonably correlated with insertion/update time. Also I guess bitmap scans on the indexed values could in principle used compressed bitmaps, excluding areas that according to the index contain no matching rows. There might be some marginal benefit there. [Q: Would this fit in at all? Any particular reason why it doesn't make sense?] Ready and grateful for any criticism, Jeroen ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings