Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote >> Can you CLUSTER >> against a minmax index? > > Not in this release, at least in my understanding. It's not yet > possible to do an ordered fetch, so the cluster scan probably won't > work. As per the patch I helped writing, CLUSTER should use the sequential heap scan+sort when "it makes sense". So I think that if the index is not able to do an ordered fetch, CLUSTER should fall back to scan+sort automatically (which is what you want in a large table anyway). Obviously, that should be tested. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778171.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 13 November 2013 11:54, Merlin Moncure wrote: >> Load time >> MinMax no overhead, same as raw COPY >> BTree - considerably slower And just as a general comment, the min max index does not slow down COPY as the table gets larger, whereas the btree gets slower as the table gets larger. Which is the reason Leonardo requires partitioned tables. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 13 November 2013 11:54, Merlin Moncure wrote: > On Wed, Nov 13, 2013 at 7:33 AM, Simon Riggs wrote: >> On 13 November 2013 09:31, Leonardo Francalanci wrote: >>> I would like to see some numbers. >> >> Alvaro has given me some results for his patch. The figures I have are >> for a 2GB table. >> >> Index Build Time >> MinMax 11 s >> Btree 96s >> >> Index Size >> MinMax 2 pages + metapage >> Btree approx 200,000 pages + metapage >> >> Load time >> MinMax no overhead, same as raw COPY >> BTree - considerably slower >> >> Index SELECT >> Slower for small groups of rows >> Broadly same for large requests (more results required on that assessment) >> >> I expect to publish results against TPC-H cases in next few weeks. >> >> Additional tests are welcome for other use cases. > > Those are pretty exciting numbers. These days for analytics work I'm > using mostly covering index type approaches. If you're using index only scans then this will work for you as well, hopefully better. Same principle wrt "all visible" page ranges. > I bet the tiny index > would more than offset the extra heap accesses. That's the trade-off, yes. I was hoping that would lead to cases where the min max is better than a btree, but not there yet. > Can you CLUSTER > against a minmax index? Not in this release, at least in my understanding. It's not yet possible to do an ordered fetch, so the cluster scan probably won't work. I was hoping to include some special Freespace Map modes that would help there. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Wed, Nov 13, 2013 at 7:33 AM, Simon Riggs wrote: > On 13 November 2013 09:31, Leonardo Francalanci wrote: >> I would like to see some numbers. > > Alvaro has given me some results for his patch. The figures I have are > for a 2GB table. > > Index Build Time > MinMax 11 s > Btree 96s > > Index Size > MinMax 2 pages + metapage > Btree approx 200,000 pages + metapage > > Load time > MinMax no overhead, same as raw COPY > BTree - considerably slower > > Index SELECT > Slower for small groups of rows > Broadly same for large requests (more results required on that assessment) > > I expect to publish results against TPC-H cases in next few weeks. > > Additional tests are welcome for other use cases. Those are pretty exciting numbers. These days for analytics work I'm using mostly covering index type approaches. I bet the tiny index would more than offset the extra heap accesses. Can you CLUSTER against a minmax index? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > From our discussions here, IMHO there is a strong case for avoiding > btrees completely for larger historical data tables. That isn't > something I had even considered as desirable before this conversation > but ISTM now that taking that approach will be more fruitful than > attempting to implement LSM trees. Eh? I don't understand this point. How can I avoid btrees, and searching by caller_id? I don't get it... Simon Riggs wrote > Alvaro has given me some results for his patch. The figures I have are > for a 2GB table. > > Index Build Time > MinMax 11 s > Btree 96s > > Index Size > MinMax 2 pages + metapage > Btree approx 200,000 pages + metapage > > Load time > MinMax no overhead, same as raw COPY > BTree - considerably slower Great!!! This looks very promising. Were the values indexed sequential? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778150.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 13 November 2013 09:31, Leonardo Francalanci wrote: > Or, in other words: what are you going to write in the minmax index > documentation, "try and see if they work better for you"? That seems like good advice to me. Bacon > Aristotle. There is very little written about suitability of any index type for Postgres. I'm sure the docs could be improved there. > Plus, I'm *very* interested in the minmax index Good, thanks. > But, at the same time, I don't see any evidence that they work > better than btrees (except for the size of the index). Min max indexes are not designed to be a better btree, they are designed to be roughly same as automatic partitioning. They offer considerably improved time to build, significantly reduced index size and significantly improved insert performance. Min max will be clearly slower than btrees for small numbers of records, though for large numbers of records we may expect min max to perform same as btrees, though that requires better testing to get a more accurate picture. There may yet be optimisations of the patch also. Based what we have discussed here, we've come up with two new optimisations that can be used with MinMax, namely the bulk DELETE case and the Merge Append case. Thank you for highlighting additional cases and requirements. >From our discussions here, IMHO there is a strong case for avoiding btrees completely for larger historical data tables. That isn't something I had even considered as desirable before this conversation but ISTM now that taking that approach will be more fruitful than attempting to implement LSM trees. Having said that, I am also in favour of declarative partitioning, just that there is no funding available to work on that at present. Further work on bitmap indexes is expected. They are already designed with good insert performance in mind and this discussion re-emphasises that requirement. > I would like to see some numbers. Alvaro has given me some results for his patch. The figures I have are for a 2GB table. Index Build Time MinMax 11 s Btree 96s Index Size MinMax 2 pages + metapage Btree approx 200,000 pages + metapage Load time MinMax no overhead, same as raw COPY BTree - considerably slower Index SELECT Slower for small groups of rows Broadly same for large requests (more results required on that assessment) I expect to publish results against TPC-H cases in next few weeks. Additional tests are welcome for other use cases. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeremy Harris wrote > Surely there's good correlation between IMSI & IMEI, so have a separate > table to translate one to (a group of) the others, and > halve the indexes on your main table? Yes; unfortunately not always both are available; but it's something we are thinking about (it requires logic in the "inserting application" that at the moment doesn't exist, but it is something that we'll have to add sooner or later). But in the end yes, trying to use less indexed-fields is a good path. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778125.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > So in the use case you describe, the min max index would require a > scan of only 25% of the table, not the 80% described earlier for > random inserts. In my experience, people wish to keep data for much > longer periods and so the percentage of scan required would drop lower > than 25%, possibly to 5% or less for many applications. > > The plan would use sequential I/O so could still work quickly; given > the low read rate, longer query times could be acceptable. "Quickly"??? Seq scan 25% of a table of TB is not "quick". Simon Riggs wrote > Minmax indexes are simply a way to make this use case happen > automatically, without the need for manual partitioning of the table. You logic assumes that we don't index anything but the call_timestamp. That would lead to huge query response times. Plus: btree doesn't have a big problem to keep up in sequential insertion scenario (such as the call_timestamp). So I still don't see the gain in using the minmax indexes: again, could you point me to some performance tests of *any* use case? Simon Riggs wrote > They are not the answer to every prayer, but with respect they are > better than you had claimed they would be. (25% not 80%, in your use > case). I saw this was likely to be the case and this is why I > challenged you to describe in more detail. Thank you. I claimed they would scan the 80% of the table because I assumed I had to use them in the random fields; not in the call_timestamp field. I don't need a better index in the call_timestamp, because it's sequential, I don't have problems with that. But it's useless: I don't want to seq scan 25% of a multi-TB table. Simon Riggs wrote > Performance tests are only possible with a clear use case. Well, so I can add my weird_index patch in postgresql source code, and it would be committed right away??? I assumed you had to prove somehow that the new index was better than what is already available, at least for some cases. Or, in other words: what are you going to write in the minmax index documentation, "try and see if they work better for you"? Simon Riggs wrote > Please see that Alvaro and I have gone out of our way to provide a new > facility to help you and others, and that it requires changing how we > think about the solution. I accept it may not provide for every case > but it requires careful analysis before deciding that is so. If I came out too rough, I ask your pardon. I always appreciate people taking their time to help someone else for free. Plus, I'm *very* interested in the minmax index, especially for call_timestamp (some queries are date-range only, such as "give me all the calls in this particular 2 secs range) or the id column I have. But, at the same time, I don't see any evidence that they work better than btrees (except for the size of the index). I would like to see some numbers. I worked a little in the bitmap index implementation, and I stopped because on the large tables these indexes are supposed to be used, the heap lookup took so much time that the (slightly) faster index access didn't really help, because it was a fraction of the query time... I'm afraid it would be the same with minmax indexes, that's why I wanted to see some numbers... Simon Riggs wrote >> And, again, I think that random values insertion is the worst >> use case for minmax indexes. > > I agree, it is. What we have disagreed on is the extent to which that > scenario exists for use cases on very large tables, which are > typically "append-mostly" with most queries accessing a subset of the > table, e.g. date range. Mhh... maybe this is this point we don't understand each other? I query the table by userid + date range. The date range is *not* selective enough (it's, as you said, 25% of the multi-TB table). The userid is selective *a lot*. I'm looking for a "better" index for the userid column(s). The "new" indexes I mentioned in the OP claim they are better in this scenario (but I don't blindly believe them) I don't see how indexing the call_timestamp only could be of any interest, since it would require seq-scanning 25% of a huge table for every query. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778124.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 13/11/13 09:07, Leonardo Francalanci wrote: Problem? having 4 btree indexes on random values (imsi+imei * 2, since we have calling and caller) kills the performance in insertion after a while. Surely there's good correlation between IMSI & IMEI, so have a separate table to translate one to (a group of) the others, and halve the indexes on your main table? -- Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 13 November 2013 06:07, Leonardo Francalanci wrote: > The use case is pretty simple. > Think it as the NSA, as it would be much easier. > Collect all calls made/received by any user on a mobile network. > (in fact, it's something more than calls, so in fact is 2-10 rows per call). > Keep the data for 20 days. > That's the insert part. > > Query: > search calls made/received by the user using IMSI (caller id) or IMEI > (phone id). Date range is usually days (past 4 days, from 10 days ago > to 5 days ago...) > > The result is just a very small percentage of the rows present in the > table: a single user doesn't call that much! > Searches are made by a human, so no that many request per second. > > It's not a "write mostly" scenario, it's a 99% write 1% read scenario. > > Problem? having 4 btree indexes on random values (imsi+imei * 2, > since we have calling and caller) kills the performance in insertion > after a while. > Solution so far? partition every 15 minutes, create the indexes in bulk. So in the use case you describe, the min max index would require a scan of only 25% of the table, not the 80% described earlier for random inserts. In my experience, people wish to keep data for much longer periods and so the percentage of scan required would drop lower than 25%, possibly to 5% or less for many applications. The plan would use sequential I/O so could still work quickly; given the low read rate, longer query times could be acceptable. Minmax indexes are simply a way to make this use case happen automatically, without the need for manual partitioning of the table. They are not the answer to every prayer, but with respect they are better than you had claimed they would be. (25% not 80%, in your use case). I saw this was likely to be the case and this is why I challenged you to describe in more detail. Thank you. > Simon Riggs wrote >> If we have a query to show the most recent calls by a particular caller >> >> SELECT * >> FROM cdr >> WHERE callerid = X >> ORDER BY call_timestamp DESC >> LIMIT 100 >> >> then this could potentially be optimised using a minmax index, by >> traversing the data ranges in call_timestamp order. That is not part >> of the code in this initial release, since the main use case is for >> WHERE call_timestamp >= X, or WHERE primarykey = Y > > I don't understand how a index on call_timestamp would help > in the query above. The min max index would cover call_timestamp and would be used to optimise the query. That is not in the current version, it is a later optimisation that I think is possible after considering your use case and similar ones. This is a similar optimisation to the Merge Append case for partitioned tables. > Simon Riggs wrote >> I don't believe there is a credible business case for running that >> same query but without the ORDER BY and LIMIT, since it could >> potentially return gazillions of rows > > > Gazillion of rows??? We're talking about calls made/received by > one user here. How many calls do you make in 10 days??? > > > Simon Riggs wrote >> so it isn't surprising at all >> that it would access a large % of the table. > > In fact, the query I use return a fraction of the table, and only > a very small amount of users get searched. > > Simon, you keep on talking about these minmax indexes, and > I still don't see any reference to some performance tests. Performance tests are only possible with a clear use case. It would be helpful if you could benchmark your own use case and I request others do the same. I have and will challenge people that simply assert this new type of index is not useful based on a generic argument, with no use case and no tests. That is nothing personal, it is simply that I do not wish misunderstandings to block the adoption of new features. Please see that Alvaro and I have gone out of our way to provide a new facility to help you and others, and that it requires changing how we think about the solution. I accept it may not provide for every case but it requires careful analysis before deciding that is so. > And, again, I think that random values insertion is the worst > use case for minmax indexes. I agree, it is. What we have disagreed on is the extent to which that scenario exists for use cases on very large tables, which are typically "append-mostly" with most queries accessing a subset of the table, e.g. date range. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > On 5 November 2013 14:28, Leonardo Francalanci < > m_lists@ > > wrote: > >> Either my sql is not correct (likely), or my understanding of the minmax >> index is >> not correct (even more likely), or the minmax index is not usable in a >> random inputs >> scenario. > > Please show the real world SQL you intend to run, so we can comment on > it. Inventing a use case that breaks effectiveness of any optimisation > is always easy, but the question is whether the use case is likely or > even desirable. The use case is pretty simple. Think it as the NSA, as it would be much easier. Collect all calls made/received by any user on a mobile network. (in fact, it's something more than calls, so in fact is 2-10 rows per call). Keep the data for 20 days. That's the insert part. Query: search calls made/received by the user using IMSI (caller id) or IMEI (phone id). Date range is usually days (past 4 days, from 10 days ago to 5 days ago...) The result is just a very small percentage of the rows present in the table: a single user doesn't call that much! Searches are made by a human, so no that many request per second. It's not a "write mostly" scenario, it's a 99% write 1% read scenario. Problem? having 4 btree indexes on random values (imsi+imei * 2, since we have calling and caller) kills the performance in insertion after a while. Solution so far? partition every 15 minutes, create the indexes in bulk. Simon Riggs wrote > If we have a query to show the most recent calls by a particular caller > > SELECT * > FROM cdr > WHERE callerid = X > ORDER BY call_timestamp DESC > LIMIT 100 > > then this could potentially be optimised using a minmax index, by > traversing the data ranges in call_timestamp order. That is not part > of the code in this initial release, since the main use case is for > WHERE call_timestamp >= X, or WHERE primarykey = Y I don't understand how a index on call_timestamp would help in the query above. Simon Riggs wrote > I don't believe there is a credible business case for running that > same query but without the ORDER BY and LIMIT, since it could > potentially return gazillions of rows Gazillion of rows??? We're talking about calls made/received by one user here. How many calls do you make in 10 days??? Simon Riggs wrote > so it isn't surprising at all > that it would access a large % of the table. In fact, the query I use return a fraction of the table, and only a very small amount of users get searched. Simon, you keep on talking about these minmax indexes, and I still don't see any reference to some performance tests. And, again, I think that random values insertion is the worst use case for minmax indexes. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778092.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
2013/11/12 Simon Riggs : > On 12 November 2013 21:41, Nicolas Barbier wrote: > >> Look-up speed is as follows: Each look-up must look through all >> B-trees. > > That can be optimised by using a min max approach, so we need only > look at sub-trees that may contain data. Under the assumption that the data are really *really* inserted in random order, I think that min max indexes would not help much: All subtrees would contain so many different values that the boundaries will almost never be narrow enough to exclude scanning it (unless one is searching for outliers). I think that min max indexes are very useful for data that is inserted in a some less-than-random way: When rather large swathes of inserted data fall in between boundaries that a lot of other data doesn’t fall in between. For min max index scans to be fast, the data doesn’t exactly have to be ordered in the heap (that’s one of the advantages vs. B-trees, where a different order in the index and the heap is very bad for scans of any significant part of the index), but it can also not be entirely arbitrary. I would say that min max indexes should be used in either of the following cases (and probably some more that I don’t think of right now): * When the data is inserted close to ordered (i.e., past or future dates that have a tendency to be correlated with “the current day,” such as invoice dates). For this usecase, a normal B-tree would also work, but would take more space and require more heavy maintenance. * When large batches of “similar” data (fitting in between boundaries that are more narrow than the “global boundaries”) are inserted at a time, even when multiple of such batches arrive in random order. * When insertion is up to entirely random, but the goal is to optimize look-ups for (relatively rare) outliers. * (combined with any the above to actually make the index *useful*) When needing a lot a indexes on the same data or having a very high rate of insertion, and therefore the maintenance burden matters a lot. > I would add that it is possible to optimise large DELETEs from a table > if complete sub-trees of the btree can be easily removed. This for me > would be the compelling benefit of this approach. Idem WRT the randomness: If the data are really deleted in a completely random fashion (e.g., Leonardo might want to delete all phone calls in a certain year, while the index is on phone number), whole subtrees would almost never become candidates for deletion (the same problem applies to a normal min max index). Of course, everything would change if the data is not really deleted randomly. The same usecases as mentioned above (replace “insertion” with “deletion”) seem to apply for deletion in min max indexes. Note that B-forests as described before don’t work well for a high (or even not-so-high) rate of deletions: During VACUUM the updates to the bigger trees would kill all performance similar to how a high rate of insertion kills the performance of normal B-trees once they get big. To remedy this, one could adds stuff such as “B-trees of deleted entries” (i.e., negative trees) that may then afterwards be merged with other such “negative” trees + “positive” trees. Look-ups would need to take all those trees into account. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 12 November 2013 19:54, Claudio Freire wrote: > On Tue, Nov 12, 2013 at 7:14 PM, Simon Riggs wrote: >> I still think we need to look at this from a query perspective though. >> We need to check whether there is a class of real world queries that >> are not well optimised by minmax indexes, or cannot be made to be in >> future releases. For example, large DELETEs from a table are almost >> trivially optimised for min max. > > Only if you don't have a PK (or other index). Right. Min max indexes are optimised for large DELETEs, btrees are not (yet), which is what we are discussing. I believe it remains to be shown that a btree is actually desirable on a very big table. So far the discussion has just assumed this is the case, without looking at specific SQL. It might be better to look at ways of avoiding a btree altogether than to spend time optimising btrees for this case. Perhaps we can enforce a PK constraint without using a btree, if one is required. This might be guaranteed by using a sequence or other mechanism. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
2013/11/12 Claudio Freire : > On Tue, Nov 12, 2013 at 6:41 PM, Nicolas Barbier > wrote: > >> (Note that K B-trees can be merged by simply scanning all of them >> concurrently, and merging them just like a merge sort merges runs. >> Also, all B-trees except for the first level (of size S) can be >> compacted 100% as there is no need to reserve space for further >> insertions in them.) > > Unless you can guarantee strong correlation of index-order vs > physical-order, scanning multiple indexes in index-order will be quite > slow (random I/O). As all the bigger trees are written in one pass (as the result of a merge of multiple smaller trees), I would assume that it is rather easy to guarantee it for them. As for the smallest trees (size S), I think it doesn’t matter much as they “fit easily in memory.” Initially I would say that redefining it so that K of them (rather than 1) must still fit in memory is the easy fix. A future optimization could alleviate the need for the redefinition (and would also improve normal B-tree indexes): Somehow make sure that smaller trees (that fit in memory) are typically written out more-or-less in the right order. For that, one could for example postpone determining the ultimate block-order to write-out time. This is similar to what Reiser4 calls “dancing trees,” but unfortunately requires some rejiggering of the abstraction layers on PostgreSQL (I think). Having deferred insertion (which is probably way easier to implement) could conceivably also improve things. https://en.wikipedia.org/wiki/Dancing_tree> Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 12, 2013 at 6:41 PM, Nicolas Barbier wrote: > (Note that K B-trees can be merged by simply scanning all of them > concurrently, and merging them just like a merge sort merges runs. > Also, all B-trees except for the first level (of size S) can be > compacted 100% as there is no need to reserve space for further > insertions in them.) Unless you can guarantee strong correlation of index-order vs physical-order, scanning multiple indexes in index-order will be quite slow (random I/O). On Tue, Nov 12, 2013 at 7:14 PM, Simon Riggs wrote: > I still think we need to look at this from a query perspective though. > We need to check whether there is a class of real world queries that > are not well optimised by minmax indexes, or cannot be made to be in > future releases. For example, large DELETEs from a table are almost > trivially optimised for min max. Only if you don't have a PK (or other index). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 12 November 2013 21:41, Nicolas Barbier wrote: > Look-up speed is as follows: Each look-up must look through all > B-trees. That can be optimised by using a min max approach, so we need only look at sub-trees that may contain data. > Index size: I think (didn’t calculate) that the combined size of the > B-trees will be about the same as (a little bit more than) the size of > a single big B-tree containing the same entries. Agreed > Major missing piece in PostgreSQL (I think): > > * Functionality to merge K indexes into one (bigger) combined index. Good analysis. I would add that it is possible to optimise large DELETEs from a table if complete sub-trees of the btree can be easily removed. This for me would be the compelling benefit of this approach. I still think we need to look at this from a query perspective though. We need to check whether there is a class of real world queries that are not well optimised by minmax indexes, or cannot be made to be in future releases. For example, large DELETEs from a table are almost trivially optimised for min max. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
2013/11/12 Nicolas Barbier : > In conclusion, use a “B-forest” when: > > * The index entries are small (large fan-out). > * The insertion throughput is high. > * It’s OK for look-ups to be slow. > * Extra points when the storage medium has high seek times. Oops, forgot the most important ones: * Insertions are random. * The total amount of data is very high. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
2013/11/2 Simon Riggs : > On 29 October 2013 16:10, Peter Geoghegan wrote: > >> Presumably someone will get around to implementing a btree index >> insertion buffer one day. I think that would be a particularly >> compelling optimization for us, because we could avoid ever inserting >> index tuples that are already dead when the deferred insertion >> actually occurs. > > That's pretty much what the LSM-tree is. [ Disclaimer: I have only skimmed the LSM-trees paper rather than read it thoroughly. ] LSM-trees seem to hit a wall when the total amount of data gets big enough, unless one uses “multi-component LSM-trees” (as described in the paper). Having a B-tree with deferred insertion would be similar to having an LSM-tree without the multi-component property. The underlying problem with fast random insertions into a B-tree is that each write of a whole block writes only a small amount of “useful changes” (once the tree gets large enough relative to memory). The “multi-component” concept fixes that. I think that simply combining that concept with normal B-trees is a rather elegant way of (at least theoretically) solving the problem: Definitions: * Define a rather small integer K (with K ≥ 2), that will influence maximum insertion speed (higher K = higher insertion speed), but also look-up time (higher K = slower look-up). * Define some size S “that easily fits in memory.” * F is the fan-out of the B-tree. (If F < K, the algorithm results in slower amortized insertion speed than simply using one single big B-tree if only the amount of blocks read/written are taken into account; it may still be better because of seek times.) * n is the total number of entries in the index. Algorithm: * Start inserting into a small B-tree until it gets size S, then start inserting into a new B-tree until that fills up to size S, etc. * When K such B-trees (of size S) exist, merge them together into one (S × K)-sized B-tree (delete the old ones). * Do this recursively: Once K B-trees of size (K × S) exist, merge them together into one (S × K^2)-sized B-tree, etc. * Let each look-up walk all trees, and return the union of all results. (Note that K B-trees can be merged by simply scanning all of them concurrently, and merging them just like a merge sort merges runs. Also, all B-trees except for the first level (of size S) can be compacted 100% as there is no need to reserve space for further insertions in them.) Insertion speed can be calculated as follows (I disregard seek times to make this easy; it also happens to be the right assumption for SSDs): Assume that a small B-tree (size S) can simply be written out without having to write each block multiple times. Each entry has to be written log_K(n × log_F(n)) times. All blocks written are 100% “useful changes.” Therefore, insertion speed is log_K(n × log_F(n)) times less than raw disk speed. (Note that I also disregard the time needed for reading; This may make everything about twice as slow.) Example: For K = 10, F = 100 (i.e., 80B per entry), blocksize = 8kB, and n = 10^9 (i.e., 80GB of entries), the insertion speed is log_10(10^9 × log_100(10^9)) = log_10(10^9 × 4.5) = ~9.5 times slower than writing the 80GB of index entries sequentially. For the “one single big B-tree” scenario, the insertion speed would be ~F = ~100 times slower than raw disk speed (assuming that all writes but the ones to the leaf-level can be combined). Look-up speed is as follows: Each look-up must look through all B-trees. Assume for simplicity that all trees have the same height as the single B-tree in the “one single big B-tree” case (i.e., which is rather wrong (most are less tall), but seems good enough as an estimate), there are K trees of each size (idem), and there are log_K(n) different sizes. Then, the number of trees to walk is K × log_K(n), and therefore each look-up is K × log_K(n) slower than in the “one single big B-tree” case. Example: (using the same parameters as above) Look-up speed is 10 × log_10(10^9) = 90 times slower (i.e., because there are ~90 trees). Index size: I think (didn’t calculate) that the combined size of the B-trees will be about the same as (a little bit more than) the size of a single big B-tree containing the same entries. I have no idea whether someone already gave this kind of “forest of B-trees” structure a name. Otherwise, I suggest “B-forest” :-). In conclusion, use a “B-forest” when: * The index entries are small (large fan-out). * The insertion throughput is high. * It’s OK for look-ups to be slow. * Extra points when the storage medium has high seek times. Major missing piece in PostgreSQL (I think): * Functionality to merge K indexes into one (bigger) combined index. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 5 November 2013 14:28, Leonardo Francalanci wrote: > Either my sql is not correct (likely), or my understanding of the minmax > index is > not correct (even more likely), or the minmax index is not usable in a > random inputs > scenario. Please show the real world SQL you intend to run, so we can comment on it. Inventing a use case that breaks effectiveness of any optimisation is always easy, but the question is whether the use case is likely or even desirable. If we have a query to show the most recent calls by a particular caller SELECT * FROM cdr WHERE callerid = X ORDER BY call_timestamp DESC LIMIT 100 then this could potentially be optimised using a minmax index, by traversing the data ranges in call_timestamp order. That is not part of the code in this initial release, since the main use case is for WHERE call_timestamp >= X, or WHERE primarykey = Y I don't believe there is a credible business case for running that same query but without the ORDER BY and LIMIT, since it could potentially return gazillions of rows, so it isn't surprising at all that it would access a large % of the table. Saying "but I really do want to run it" isn't an argument in favour of it being a sensible query to run - we are only interested in optimising sensible real world queries. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Thu, Oct 31, 2013 at 12:43 AM, Leonardo Francalanci wrote: > Jeff Janes wrote > > True, but that is also true of indexes created in bulk. It all has to > > reach disk eventually-- > > [...] > > If the checkpoint interval is as long as the partitioning period, then > > hopefully the active index buffers get re-dirtied while protected in > > shared_buffers, and only get written to disk once. > > Honestly, I made a lot of tests in the past, and I don't remember if I > tried > 15-minute checkpoints + high shared_buffers. That might work. I'm going to > try it and see what happens. > You might want to go even beyond 15 minutes. > > > Jeff Janes wrote > > If the buffers get read, dirtied, and evicted from a small shared_buffers > > over and over again > > then you are almost guaranteed that will get written to disk multiple > > times > > (as I understand, but I might be wrong): > high shared_buffers don't help because in such a random index writing, lots > and lots of pages get dirtied, even if the change in the page was minimal. > So, in the "15-minute" period, you write the same pages over and over > again. > But you write them only if you need to due to a checkpoint, needing new buffers to read in something else that is not already in shared_buffers, or because you are using a buffer-access-strategy that uses a ring. If you make checkpoints longs, it will cut down on the first. If shared_buffers is large enough to contain the active part of the indexes being updated, that should cut down on the second. I don't know if the third is a problem or not--I think copy might try to use a ring-buffer, but I don't if it does that for indexed table. > Even if you have high shared_buffers, the same page will get sync-ed to > disk > multiple times (at every checkpoint). > If the active part of the indexes is much larger than you can could possibly set shared_buffers to, then there is probably little point in increasing shared_buffers from, say, 1% of the active index size to 8% of it. It only makes sense to increase it if you can do so large enough to cover ~100% of the needed space. > The idea of those "other" indexes is to avoid the random writing, > maximizing > the writing in sequence, even if that means writing more bytes. In other > words: writing a full 8KB is no different than write 20 bytes in a page, as > we'll have to sync the whole page anyway... > True, but that is the idea here as well. If you can delay writing the page until 20 bytes of it have been dirtied on 400 different occasions... I'm not saying we shouldn't think about some kind of insert buffer, but I really doubt that that is going to happen in 9.4 while increasing shared_buffers can be done today, if it works and if you can live with the consequences. Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 5, 2013 at 9:52 AM, Leonardo Francalanci wrote: > Jeff Janes wrote > > Some experiments I did a few years ago showed that applying sorts to the > > data to be inserted could be helpful even when the sort batch size was as > > small as one tuple per 5 pages of existing index. Maybe even less. > > Cool!!! Do you have any idea/hint on how I could try and replicate that? > Do you remember how you did it? > I can't find my notes but I remember more or less how I did it. Since we don't yet have an insertion buffer that allows the rows to be sorted in different order for different indexes, I had to simulate it just by using a table with a single index and hoping that that would extrapolate. create table foo (x bigint); To speed things up, you may want to prepopulate this with random data so that the size of the index-to-be will exceed shared_buffers, or physical RAM, before making the index. Also, the effectiveness might depend on how much the index has grown since its creations, since leaf pages are initially correlated between physical order and logical order, but that decreases over time. So you may want to try different initial seed sizes. create index on foo (x); Then I use perl to make run-sorted data with different run sizes, and load that via \copy. I put all the data points in memory up front rather than generating it per-run on the fly, so that perl consumes about the same amount of memory regardless of the run size. You would want to use more than 1..1e6 if you are on a very large RAM machine. Something like: for $run_size in 1 10 100 1000 1 10; do perl -le 'my @x; push @x, int(rand()*1e8) foreach 1..1e6; while (@x) {print foreach sort {$a<=>$b} splice @x,0,'$run_size'; }'| time psql -c '\copy foo from stdin'; done But you probably want another inner loop so that the \copy gets executed multiple times per run_size, so that each run_size executes for at least a couple checkpoint cycles. Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
Greg Stark escribió: > I think minmax indexes are more akin to bitmap indexes. They will be very > effective for columns with low-cardinality, especially for columns that are > very clustered. In the extreme if all the values in some regions of the > table are the same then minmax indexes would be optimal. I wouldn't expect > them to be very effective for a highly selective column that isn't well > clustered. I think clustering is more important than cardinality. I mean you can have a very useful minmax index on a float column, on which maybe there are no two identical values. I certainly don't think minmax indexes will solve all indexing problems. In the end, they're just one more tool in your DBA toolbox. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 5, 2013 at 5:30 PM, Claudio Freire wrote: > > Maybe there's value in minmax indexes for sequential data, but not for > > random data, which is the topic of this thread. > > > Well, of course, they're not magic pixie dust. > > But is your data really random? (or normal?) I think minmax indexes are more akin to bitmap indexes. They will be very effective for columns with low-cardinality, especially for columns that are very clustered. In the extreme if all the values in some regions of the table are the same then minmax indexes would be optimal. I wouldn't expect them to be very effective for a highly selective column that isn't well clustered. It really sounds like you're describing a particular workload that btrees could just be more optimized for. Buffering all inserts in memory and merging them into the btree lazily is actually something Heikki has proposed in the past. I'm not clear if that gets you all the benefits of the indexes you described or not but it seems to target the particular problem you're having. -- greg
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 5, 2013 at 2:52 PM, Leonardo Francalanci wrote: > Jeff Janes wrote >> Some experiments I did a few years ago showed that applying sorts to the >> data to be inserted could be helpful even when the sort batch size was as >> small as one tuple per 5 pages of existing index. Maybe even less. > > Cool!!! Do you have any idea/hint on how I could try and replicate that? > Do you remember how you did it? I do it regularly by sorting tuples before inserting/updating. It helps quite significantly for batches of ~1000 tuples (well, in my case). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > Some experiments I did a few years ago showed that applying sorts to the > data to be inserted could be helpful even when the sort batch size was as > small as one tuple per 5 pages of existing index. Maybe even less. Cool!!! Do you have any idea/hint on how I could try and replicate that? Do you remember how you did it? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777056.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Claudio Freire wrote > Well, of course, they're not magic pixie dust. Of course they aren't. I think they can make a difference in a sequential input scenario. But I'm not the one who said that they are fit to solve the problems me and other people are talking about in this thread. Claudio Freire wrote > But is your data really random? (or normal?) > That's the thing... I still don't understand. Even if the data was normal, it wouldn't work. You can try and change the 3rd parameter in normal_rand and get a "narrower" distribution, but at the same time the query values should be narrower... so you'll get the same results. The problem is: if the range you get between min and max is too big in each page range, you'll end up scanning a lot of heap pages. To me, "the thing" is: has anyone made performance tests of these minmax indexes with an input that is not sequential? (BTW I'd like to see tests for the sequential input scenario too...) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777055.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 5, 2013 at 12:22 PM, Leonardo Francalanci wrote: > Claudio Freire wrote >> you haven't really >> analyzed update cost, which is what we were talking about in that last >> post. > > I don't care for a better update cost if the cost to query is a table scan. > Otherwise, I'll just claim that no index at all is even better than minmax: > 0 update cost, pretty much same query time. > > Maybe there's value in minmax indexes for sequential data, but not for > random data, which is the topic of this thread. Well, of course, they're not magic pixie dust. But is your data really random? (or normal?) That's the thing... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 5, 2013 at 12:25 AM, Leonardo Francalanci wrote: > Andres Freund-3 wrote > > On 2013-11-04 11:27:33 -0500, Robert Haas wrote: > >> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire < > > > klaussfreire@ > > > > wrote: > >> > Such a thing would help COPY, so maybe it's worth a look > >> > >> I have little doubt that a deferred insertion buffer of some kind > >> could help performance on some workloads, though I suspect the buffer > >> would have to be pretty big to make it worthwhile on a big COPY that > >> generates mostly-random insertions. > > > > Even for random data presorting the to-be-inserted data appropriately > > could result in much better io patterns. > > Mmh, I'm afraid that the buffer should be huge to get some real advantage. > You have to buffer enough values to avoid "touching" entire pages, which is > not that easy. Some experiments I did a few years ago showed that applying sorts to the data to be inserted could be helpful even when the sort batch size was as small as one tuple per 5 pages of existing index. Maybe even less. Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
Claudio Freire wrote > real data isn't truly random Well, let's try normal_rand??? create table t1 as select trunc(normal_rand(100, 50, 3)) as n, generate_series(1, 100) as i; with cte as (select min(n) as minn, max(n) as maxn, i/100 from t1 group by i/100), inp as (select generate_series(1, 100) iinp, trunc(normal_rand(100, 50, 3)) as s) select count(*),iinp from inp left outer join cte on inp.s between minn and maxn group by iinp; Not that much different in my run... Claudio Freire wrote > you haven't really > analyzed update cost, which is what we were talking about in that last > post. I don't care for a better update cost if the cost to query is a table scan. Otherwise, I'll just claim that no index at all is even better than minmax: 0 update cost, pretty much same query time. Maybe there's value in minmax indexes for sequential data, but not for random data, which is the topic of this thread. BTW I would like to see some performance tests on the minmax indexes vs btree for the sequential inputs... is the gain worth it? I couldn't find any mention of performance tests in the minmax threads. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777020.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 5, 2013 at 11:28 AM, Leonardo Francalanci wrote: > I get > 9000 pages for 49 values out of 50... which means scanning 90% of > the table. > > Either my sql is not correct (likely), or my understanding of the minmax > index is > not correct (even more likely), or the minmax index is not usable in a > random inputs > scenario. Yep, you're correct. That's the cost for querying random values. But, both real data isn't truly random, and you haven't really analyzed update cost, which is what we were talking about in that last post. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Claudio Freire wrote > Min-max indexes always require a sequential scan of the min-max index > itself when querying. I'm worried about the number of heap pages that will be scanned. My understanding is that given the random input, the index will not be selective enough, and will end up requiring a lot of page scanning to get the relevant rows. That is: the "selectivity" of the min/max values will be very low, given the high cardinality and randomness of the input; I'm afraid that, in most "page ranges", the min will be lower than the queried ones, and the max higher, resulting in lots of heap page scans. Quick test: a table with 1M rows, with random values from 1 to 1000: create table t as select generate_series(1, 100) as i, trunc(random() * 1000) as n; suppose a page range contains 100 rows (but my understanding is that minmax index will use a much bigger row count...), let's find how many "page ranges" should be scanned to find a series of 50 values (again, random in the 1-1000 range): with cte as (select min(n) as minn, max(n) as maxn, i/100 from t group by i/100), inp as (select generate_series(1, 50) iinp, trunc(random() * 1000) as s) select count(*) from inp left outer join cte on inp.s between minn and maxn group by iinp I get > 9000 pages for 49 values out of 50... which means scanning 90% of the table. Either my sql is not correct (likely), or my understanding of the minmax index is not correct (even more likely), or the minmax index is not usable in a random inputs scenario. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777009.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Nov 5, 2013 at 6:57 AM, Leonardo Francalanci wrote: > Simon Riggs wrote >> Minmax indexes seem to surprise many people, so broad generalisations >> aren't likely to be useful. >> >> I think the best thing to do is to publish some SQL requests that >> demonstrate in detail what you are trying to achieve and test them >> against minmax indexes. That way we can discuss what does work and >> what doesn't work well enough yet. > > While I do believe in testing (since "In theory there is no difference > between theory and practice. In practice there is"), I would like to know > the "properties" of the minmax index before trying it. > What is it supposed to be good at? What are the pros/cons? We can't ask all > the users to just "try" the index and see if it works for them. > As I said, my understanding is that is very efficient (both in insertion and > in searching) when data is somehow ordered in the table. But maybe I got it > wrong... Well, for one, random inserts (with random data) on a min-max index have a roughly 1/N chance of requiring a write to disk, and (N-1)/N chance of being completely free (or maybe a read to verify a write isn't needed, but that'll probably hit shared buffers), where N is the number of tuples per page. Per index page that is. Of course, non-random workloads are a different matter. Min-max indexes always require a sequential scan of the min-max index itself when querying. That works when you intend to query enough tuples to make up the cost (that is, more tuples than M * N * random_cost / seq_cost), where M is the number of pages in the index. Well, actually, since they result in better io patterns as well, the tradeoff is probably a little bit more tricky than that, in favor of min-max indexes. Min-max indexes tend to be very compact, so M is usually low. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > On 5 November 2013 09:57, Leonardo Francalanci < > m_lists@ > > wrote: >> While I do believe in testing (since "In theory there is no difference >> between theory and practice. In practice there is"), I would like to know >> the "properties" of the minmax index before trying it. >> What is it supposed to be good at? What are the pros/cons? We can't ask >> all >> the users to just "try" the index and see if it works for them. > > No, but then all users aren't suggesting we need a new index type are > they? > > I think its reasonable for you to spend time checking whether what you > require will be met, and if not, what precise query doesn't it help, > so we can better design any future new-index. I don't understand the parallel "We can't ask all the users to just try the index and see if it works for them" with "all users aren't suggesting we need a new index type". Anyway: I'm not suggesting we need a new index type. Please read my first post: I'm asking info, fearing that there's just a lot of marketing/hype in those indexes. What do you mean by "spend time checking whether what you require will be met"? Met by what, minmax indexes? Simon Riggs wrote >> As I said, my understanding is that is very efficient (both in insertion >> and >> in searching) when data is somehow ordered in the table. But maybe I got >> it >> wrong... > >> Anyway, the sql scenario is simple: a table with 4 bigint indexes; data >> in >> the fields is a random bigint in the range 1-1000. Insertion is 5-10K >> rows/second. One search every 1-5 seconds, by one of the indexed fields >> (only equality, no range). There's also an index on a timestamp field, >> but >> that's not random so it doesn't give that many problems (it's actually >> the >> one where I wanted to try the minmax). > > Without meaning to pick on you, imprecise analysis yields unhelpful > new features. The clearer we are about what we are trying to solve the > more likely we are to solve it well. 30 seconds analysis on what is > needed is not sufficient to justify an additional man year of > development, especially if a man year of work has already been done > and the testing of the latest feature is now at hand. I've never said that my analysis justifies a man year of work. As I said, I'm actually not confident at all that even if we had those "cool" indexes they would work on my scenario (marketing aside, there's not that much data/tests out there on those indexes). I just wanted to know if the matter was discussed in the past / getting more info. At the same time, I'm reluctant to try a new index hoping it will work in my case just because it's new and a man year of work has already been done. Again: what's this minmax index supposed to be good at? If it's indexing data in mostly-sequential order, it won't help me. From what I got (maybe I got it wrong?) the index stores min/max values of sequence of pages. In my case I guess that those min/max values would be close to 1 (min) /1000 (max), because I insert data in random order. So any query will scan the entire table anyway. Am I wrong? Simon Riggs wrote > The requests from the indexes field, are they ordered? Mmmh, I don't think I understand the question... an operator searches for calls made by a user, so he searches in random order... Simon Riggs wrote > are they > limited? Or you really want ALL calls? What is the tolerance on those? I want all the calls made by the user. But there won't be many calls for 1 user. And most users will never be queried (as I said, one "calls by person" search every 1-5 seconds, so a very small percentage of calls will ever be queried/retrieved) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776982.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 5 November 2013 09:57, Leonardo Francalanci wrote: > Simon Riggs wrote >> Minmax indexes seem to surprise many people, so broad generalisations >> aren't likely to be useful. >> >> I think the best thing to do is to publish some SQL requests that >> demonstrate in detail what you are trying to achieve and test them >> against minmax indexes. That way we can discuss what does work and >> what doesn't work well enough yet. > > While I do believe in testing (since "In theory there is no difference > between theory and practice. In practice there is"), I would like to know > the "properties" of the minmax index before trying it. > What is it supposed to be good at? What are the pros/cons? We can't ask all > the users to just "try" the index and see if it works for them. No, but then all users aren't suggesting we need a new index type are they? I think its reasonable for you to spend time checking whether what you require will be met, and if not, what precise query doesn't it help, so we can better design any future new-index. > As I said, my understanding is that is very efficient (both in insertion and > in searching) when data is somehow ordered in the table. But maybe I got it > wrong... > Anyway, the sql scenario is simple: a table with 4 bigint indexes; data in > the fields is a random bigint in the range 1-1000. Insertion is 5-10K > rows/second. One search every 1-5 seconds, by one of the indexed fields > (only equality, no range). There's also an index on a timestamp field, but > that's not random so it doesn't give that many problems (it's actually the > one where I wanted to try the minmax). Without meaning to pick on you, imprecise analysis yields unhelpful new features. The clearer we are about what we are trying to solve the more likely we are to solve it well. 30 seconds analysis on what is needed is not sufficient to justify an additional man year of development, especially if a man year of work has already been done and the testing of the latest feature is now at hand. The requests from the indexes field, are they ordered? are they limited? Or you really want ALL calls? What is the tolerance on those? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > Minmax indexes seem to surprise many people, so broad generalisations > aren't likely to be useful. > > I think the best thing to do is to publish some SQL requests that > demonstrate in detail what you are trying to achieve and test them > against minmax indexes. That way we can discuss what does work and > what doesn't work well enough yet. While I do believe in testing (since "In theory there is no difference between theory and practice. In practice there is"), I would like to know the "properties" of the minmax index before trying it. What is it supposed to be good at? What are the pros/cons? We can't ask all the users to just "try" the index and see if it works for them. As I said, my understanding is that is very efficient (both in insertion and in searching) when data is somehow ordered in the table. But maybe I got it wrong... Anyway, the sql scenario is simple: a table with 4 bigint indexes; data in the fields is a random bigint in the range 1-1000. Insertion is 5-10K rows/second. One search every 1-5 seconds, by one of the indexed fields (only equality, no range). There's also an index on a timestamp field, but that's not random so it doesn't give that many problems (it's actually the one where I wanted to try the minmax). -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776976.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 5 November 2013 08:32, Leonardo Francalanci wrote: > Simon Riggs wrote >> Everybody on this thread is advised to look closely at Min Max indexes >> before starting any further work. >> >> MinMax will give us access to many new kinds of plan, plus they are >> about as close to perfectly efficient, by which I mean almost zero >> overhead, with regard to inserts as it is possible to get. > > Simon, I don't understand how minmax indexes would help in a random-inserts > scenario. > While I would love to use minmax for other columns (since we also partition > and search based on a timestamp, which is usually well clustered), I thought > minmax index would be perfect in a mostly-incremental values scenario. Minmax indexes seem to surprise many people, so broad generalisations aren't likely to be useful. I think the best thing to do is to publish some SQL requests that demonstrate in detail what you are trying to achieve and test them against minmax indexes. That way we can discuss what does work and what doesn't work well enough yet. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > Everybody on this thread is advised to look closely at Min Max indexes > before starting any further work. > > MinMax will give us access to many new kinds of plan, plus they are > about as close to perfectly efficient, by which I mean almost zero > overhead, with regard to inserts as it is possible to get. Simon, I don't understand how minmax indexes would help in a random-inserts scenario. While I would love to use minmax for other columns (since we also partition and search based on a timestamp, which is usually well clustered), I thought minmax index would be perfect in a mostly-incremental values scenario. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776964.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Andres Freund-3 wrote > On 2013-11-04 11:27:33 -0500, Robert Haas wrote: >> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire < > klaussfreire@ > > wrote: >> > Such a thing would help COPY, so maybe it's worth a look >> >> I have little doubt that a deferred insertion buffer of some kind >> could help performance on some workloads, though I suspect the buffer >> would have to be pretty big to make it worthwhile on a big COPY that >> generates mostly-random insertions. > > Even for random data presorting the to-be-inserted data appropriately > could result in much better io patterns. Mmh, I'm afraid that the buffer should be huge to get some real advantage. You have to buffer enough values to avoid "touching" entire pages, which is not that easy. The LSM-tree is much complicated than a simple memory-buffer + delayed inserts. The other index types that are supposed to help in the random-insertion workload rely on sequential insertions (at the expense of more writing, and slower reads) rather than re-use the btree pattern. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776963.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 30 October 2013 14:34, Yann Fontana wrote: > > >> On 30 October 2013 11:23, Leonardo Francalanci wrote: >> >> >> In terms of generality, do you think its worth a man year of developer >> >> effort to replicate what you have already achieved? Who would pay? > > > I work on an application that does exactly what Leonardo described. We hit > the exact same problem, and came up with the same exact same solution (down > to the 15 minutes interval). But I have also worked on other various > datamarts (all using Oracle), and they are all subject to this problem in > some form: B-tree indexes slow down bulk data inserts too much and need to > be disabled or dropped and then recreated after the load. In some cases this > is done easily enough, in others it's more complicated (example: every day, > a process imports from 1 million to 1 billion records into a table partition > that may contain from 0 to 1 billion records. To be as efficient as > possible, you need some logic to compare the number of rows to insert to the > number of rows already present, in order to decide whether to drop the > indexes or not). > > Basically, my point is that this is a common problem for datawarehouses and > datamarts. In my view, indexes that don't require developers to work around > poor insert performance would be a significant feature in a > "datawarehouse-ready" DBMS. Everybody on this thread is advised to look closely at Min Max indexes before starting any further work. MinMax will give us access to many new kinds of plan, plus they are about as close to perfectly efficient, by which I mean almost zero overhead, with regard to inserts as it is possible to get. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 4 November 2013 19:55, Gavin Flower wrote: > How about having a 'TRANSIENT INDEX' that only exists in memory, so there is > no requirement to write it to disk or to replicate directly? This type of > index would be very fast and easier to implement. Recovery would involve > rebuilding the index, and sharing would involve recreating on a slave. > Probably not appropriate for a primary index, but may be okay for secondary > indexes used to speed specific queries. > > This could be useful in some situations now, and allow time to get > experience in how best to implement the basic concept. Then a more robust > solution using WAL etc can be developed later. > > I suspect that such a TRANSIENT INDEX would still be useful even when a more > robust in memory index method was available. As I expect it would be faster > to set up than a robust memory index - which might be good when you need to > have one or more indexes for a short period of time, or the size of the > index is so small that recreating it requires very little time (total > elapsed time might even be less than a disk backed one?). UNLOGGED indexes have been discussed over many years and there is pretty good acceptance of the idea for some use cases. The main tasks are * marking the index so they are unavailable on a hot standby * rebuilding the index in full at the end of recovery - requires an additional process to rebuild them possibly in priority order * make sure the above doesn't violate security * marking the index so it can't be used in the planner until rebuild is complete - subtle stuff -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 5:01 PM, Simon Riggs wrote: >> Of course, it's possible that even we do get a shared memory >> allocator, a hypothetical person working on this project might prefer >> to make the data block-structured anyway and steal storage from >> shared_buffers. So my aspirations in this area may not even be >> relevant. But I wanted to mention them, just in case anyone else is >> thinking about similar things, so that we can potentially coordinate. > > If anyone was going to work on LSM tree, I would advise building a > tree in shared/temp buffers first, then merging with the main tree. > The merge process could use the killed tuple approach to mark the > merging. > > The most difficult thing about buffering the inserts is deciding which > poor sucker gets the task of cleaning up. That's probably better as an > off-line process, which is where the work comes in. Non shared > buffered approaches would add too much overhead to the main task. Thing is, if you want crash safety guarantees, you cannot use temp (unlogged) buffers, and then you always have to flush to WAL at each commit. If the staging index is shared, then it could mean a lot of WAL (ie: probably around double the amount of WAL a regular b-tree would generate). Process-private staging trees that get merged on commit, ie: transaction-scope staging trees, on the other hand, do not require WAL logging, they can use temp buffers, and since they don't outlive the transaction, it's quite obvious who does the merging (the committer). Question is what kind of workload does that speed up with any significance and whether the amount of work is worth that speedup on those workloads. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 8:09 AM, Robert Haas wrote: > On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs wrote: > > On 29 October 2013 16:10, Peter Geoghegan wrote: > >> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci > wrote: > >>> I don't see much interest in insert-efficient indexes. > >> > >> Presumably someone will get around to implementing a btree index > >> insertion buffer one day. I think that would be a particularly > >> compelling optimization for us, because we could avoid ever inserting > >> index tuples that are already dead when the deferred insertion > >> actually occurs. > > > > That's pretty much what the LSM-tree is. > > What is pretty cool about this sort of thing is that there's no > intrinsic reason the insertion buffer needs to be block-structured or > disk-backed. How do we commit to not spilling to disk, in the face of an unbounded number of indexes existing and wanting to use this mechanism simultaneously? If it routinely needs to spill to disk, that would probably defeat the purpose of having it in the first place, but committing to never doing so seems to be extremely restrictive. As you say it is also freeing, in terms of using pointers and such, but I think the restrictions would outweigh the freedom. > In theory, you can structure the in-memory portion of > the tree any way you like, using pointers and arbitrary-size memory > allocations and all that fun stuff. You need to log that there's a > deferred insert (or commit to flushing the insertion buffer before > every commit, which would seem to miss the point) so that recovery can > reconstruct the in-memory data structure and flush it, but that's it: > the WAL format need not know any other details of the in-memory > portion of the tree. I think that, plus the ability to use pointers > and so forth, might lead to significant performance gains. > > In practice, the topology of our shared memory segment makes this a > bit tricky. The problem isn't so much that it's fixed size as that it > lacks a real allocator, and that all the space used for shared_buffers > is nailed down and can't be borrowed for other purposes. I think the fixed size is also a real problem, especially given the ubiquitous advice not to exceed 2 to 8 GB. Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
On 4 November 2013 16:09, Robert Haas wrote: > On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs wrote: >> On 29 October 2013 16:10, Peter Geoghegan wrote: >>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci >>> wrote: I don't see much interest in insert-efficient indexes. >>> >>> Presumably someone will get around to implementing a btree index >>> insertion buffer one day. I think that would be a particularly >>> compelling optimization for us, because we could avoid ever inserting >>> index tuples that are already dead when the deferred insertion >>> actually occurs. >> >> That's pretty much what the LSM-tree is. > > What is pretty cool about this sort of thing is that there's no > intrinsic reason the insertion buffer needs to be block-structured or > disk-backed. In theory, you can structure the in-memory portion of > the tree any way you like, using pointers and arbitrary-size memory > allocations and all that fun stuff. You need to log that there's a > deferred insert (or commit to flushing the insertion buffer before > every commit, which would seem to miss the point) so that recovery can > reconstruct the in-memory data structure and flush it, but that's it: > the WAL format need not know any other details of the in-memory > portion of the tree. I think that, plus the ability to use pointers > and so forth, might lead to significant performance gains. Sounds like it could be cool, yes. > In practice, the topology of our shared memory segment makes this a > bit tricky. The problem isn't so much that it's fixed size as that it > lacks a real allocator, and that all the space used for shared_buffers > is nailed down and can't be borrowed for other purposes. I'm very > interested in solving the problem of getting a real allocator for > shared memory because I think I need it for parallel sort Agreed, you need a shared memory allocator for parallel query. It's just too tempting to build a hash table in shared memory on one thread, then use the hash table from multiple sessions as we do a parallel scan. Roughly same thing for parallel sort - passing pointers to complete data objects around is much easier than actually moving the data between processes, which would slow things down. We do also need generalised inter-node pipe but that's not the best solution to every problem. It's also a great way of controlling over-allocation of resources. > , and even if > there's a way to avoid needing it there I have a strong feeling that > we'll want it for other applications of dynamic shared memory. But it > should be possible to write the allocator in such a way that you just > give it a chunk of memory to manage, and it does, remaining agnostic > about whether the memory is from the main shared memory segment or a > dynamic one. Agreed > Of course, it's possible that even we do get a shared memory > allocator, a hypothetical person working on this project might prefer > to make the data block-structured anyway and steal storage from > shared_buffers. So my aspirations in this area may not even be > relevant. But I wanted to mention them, just in case anyone else is > thinking about similar things, so that we can potentially coordinate. If anyone was going to work on LSM tree, I would advise building a tree in shared/temp buffers first, then merging with the main tree. The merge process could use the killed tuple approach to mark the merging. The most difficult thing about buffering the inserts is deciding which poor sucker gets the task of cleaning up. That's probably better as an off-line process, which is where the work comes in. Non shared buffered approaches would add too much overhead to the main task. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 05/11/13 05:35, Robert Haas wrote: On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund wrote: I think doing this outside of s_b will make stuff rather hard for physical replication and crash recovery since we either will need to flush the whole buffer at checkpoints - which is hard since the checkpointer doesn't work inside individual databases - or we need to persist the in-memory buffer across restart which also sucks. You might be right, but I think part of the value of LSM-trees is that the in-memory portion of the data structure is supposed to be able to be optimized for in-memory storage rather than on disk storage. It may be that block-structuring that data bleeds away much of the performance benefit. Of course, I'm talking out of my rear end here: I don't really have a clue how these algorithms are supposed to work. How about having a 'TRANSIENT INDEX' that only exists in memory, so there is no requirement to write it to disk or to replicate directly? This type of index would be very fast and easier to implement. Recovery would involve rebuilding the index, and sharing would involve recreating on a slave. Probably not appropriate for a primary index, but may be okay for secondary indexes used to speed specific queries. This could be useful in some situations now, and allow time to get experience in how best to implement the basic concept. Then a more robust solution using WAL etc can be developed later. I suspect that such a TRANSIENT INDEX would still be useful even when a more robust in memory index method was available. As I expect it would be faster to set up than a robust memory index - which might be good when you need to have one or more indexes for a short period of time, or the size of the index is so small that recreating it requires very little time (total elapsed time might even be less than a disk backed one?). Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund wrote: > I think doing this outside of s_b will make stuff rather hard for > physical replication and crash recovery since we either will need to > flush the whole buffer at checkpoints - which is hard since the > checkpointer doesn't work inside individual databases - or we need to > persist the in-memory buffer across restart which also sucks. You might be right, but I think part of the value of LSM-trees is that the in-memory portion of the data structure is supposed to be able to be optimized for in-memory storage rather than on disk storage. It may be that block-structuring that data bleeds away much of the performance benefit. Of course, I'm talking out of my rear end here: I don't really have a clue how these algorithms are supposed to work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 2013-11-04 11:27:33 -0500, Robert Haas wrote: > On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire > wrote: > > Such a thing would help COPY, so maybe it's worth a look > > I have little doubt that a deferred insertion buffer of some kind > could help performance on some workloads, though I suspect the buffer > would have to be pretty big to make it worthwhile on a big COPY that > generates mostly-random insertions. Even for random data presorting the to-be-inserted data appropriately could result in much better io patterns. > I think the question is not so > much whether it's worth doing but where anyone's going to find the > time to do it. Yea :( I think doing this outside of s_b will make stuff rather hard for physical replication and crash recovery since we either will need to flush the whole buffer at checkpoints - which is hard since the checkpointer doesn't work inside individual databases - or we need to persist the in-memory buffer across restart which also sucks. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 11:31 AM, Claudio Freire wrote: > On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas wrote: >> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire >> wrote: >>> Such a thing would help COPY, so maybe it's worth a look >> >> I have little doubt that a deferred insertion buffer of some kind >> could help performance on some workloads, though I suspect the buffer >> would have to be pretty big to make it worthwhile on a big COPY that >> generates mostly-random insertions. I think the question is not so >> much whether it's worth doing but where anyone's going to find the >> time to do it. > > > However, since an admin can increase work_mem for that COPY, using > work_mem for this would be reasonable, don't you agree? Without implementing it and benchmarking the result, it's pretty hard to know. But if I were a betting man, I'd bet that's not nearly big enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas wrote: > On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire > wrote: >> Such a thing would help COPY, so maybe it's worth a look > > I have little doubt that a deferred insertion buffer of some kind > could help performance on some workloads, though I suspect the buffer > would have to be pretty big to make it worthwhile on a big COPY that > generates mostly-random insertions. I think the question is not so > much whether it's worth doing but where anyone's going to find the > time to do it. However, since an admin can increase work_mem for that COPY, using work_mem for this would be reasonable, don't you agree? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire wrote: > Such a thing would help COPY, so maybe it's worth a look I have little doubt that a deferred insertion buffer of some kind could help performance on some workloads, though I suspect the buffer would have to be pretty big to make it worthwhile on a big COPY that generates mostly-random insertions. I think the question is not so much whether it's worth doing but where anyone's going to find the time to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 1:09 PM, Robert Haas wrote: > On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs wrote: >> On 29 October 2013 16:10, Peter Geoghegan wrote: >>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci >>> wrote: I don't see much interest in insert-efficient indexes. >>> >>> Presumably someone will get around to implementing a btree index >>> insertion buffer one day. I think that would be a particularly >>> compelling optimization for us, because we could avoid ever inserting >>> index tuples that are already dead when the deferred insertion >>> actually occurs. >> >> That's pretty much what the LSM-tree is. > > What is pretty cool about this sort of thing is that there's no > intrinsic reason the insertion buffer needs to be block-structured or > disk-backed. In theory, you can structure the in-memory portion of > the tree any way you like, using pointers and arbitrary-size memory > allocations and all that fun stuff. You need to log that there's a > deferred insert (or commit to flushing the insertion buffer before > every commit, which would seem to miss the point) Such a thing would help COPY, so maybe it's worth a look -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs wrote: > On 29 October 2013 16:10, Peter Geoghegan wrote: >> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci >> wrote: >>> I don't see much interest in insert-efficient indexes. >> >> Presumably someone will get around to implementing a btree index >> insertion buffer one day. I think that would be a particularly >> compelling optimization for us, because we could avoid ever inserting >> index tuples that are already dead when the deferred insertion >> actually occurs. > > That's pretty much what the LSM-tree is. What is pretty cool about this sort of thing is that there's no intrinsic reason the insertion buffer needs to be block-structured or disk-backed. In theory, you can structure the in-memory portion of the tree any way you like, using pointers and arbitrary-size memory allocations and all that fun stuff. You need to log that there's a deferred insert (or commit to flushing the insertion buffer before every commit, which would seem to miss the point) so that recovery can reconstruct the in-memory data structure and flush it, but that's it: the WAL format need not know any other details of the in-memory portion of the tree. I think that, plus the ability to use pointers and so forth, might lead to significant performance gains. In practice, the topology of our shared memory segment makes this a bit tricky. The problem isn't so much that it's fixed size as that it lacks a real allocator, and that all the space used for shared_buffers is nailed down and can't be borrowed for other purposes. I'm very interested in solving the problem of getting a real allocator for shared memory because I think I need it for parallel sort, and even if there's a way to avoid needing it there I have a strong feeling that we'll want it for other applications of dynamic shared memory. But it should be possible to write the allocator in such a way that you just give it a chunk of memory to manage, and it does, remaining agnostic about whether the memory is from the main shared memory segment or a dynamic one. Of course, it's possible that even we do get a shared memory allocator, a hypothetical person working on this project might prefer to make the data block-structured anyway and steal storage from shared_buffers. So my aspirations in this area may not even be relevant. But I wanted to mention them, just in case anyone else is thinking about similar things, so that we can potentially coordinate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 29 October 2013 16:10, Peter Geoghegan wrote: > On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci > wrote: >> I don't see much interest in insert-efficient indexes. > > Presumably someone will get around to implementing a btree index > insertion buffer one day. I think that would be a particularly > compelling optimization for us, because we could avoid ever inserting > index tuples that are already dead when the deferred insertion > actually occurs. That's pretty much what the LSM-tree is. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Gavin Flower-2 wrote > How about being able to mark indexes: > 'MEMORY ONLY' to make them not go to disk > and > 'PERSISTENT | TRANSIENT' to mark if they should be recreated on > machine bootup? I would love that. But: 1) I'd like to make some tests with a "memory drive", and confirm that in fact this would help (I'm sure I tried in the past, but I don't remember the outcome) 2) I don't know if the fact that they are in memory should be handled by the db or not. I was thinking about something more like "RECREATE IF NOT FOUND", that is: if the files aren't found at postgresql startup, re-create the index... 3) I don't know how many people would be interested (and how doable/complicated that would be, considering log-replay, replication etc etc) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776471.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > True, but that is also true of indexes created in bulk. It all has to > reach disk eventually-- > [...] > If the checkpoint interval is as long as the partitioning period, then > hopefully the active index buffers get re-dirtied while protected in > shared_buffers, and only get written to disk once. Honestly, I made a lot of tests in the past, and I don't remember if I tried 15-minute checkpoints + high shared_buffers. That might work. I'm going to try it and see what happens. Jeff Janes wrote > If the buffers get read, dirtied, and evicted from a small shared_buffers > over and over again > then you are almost guaranteed that will get written to disk multiple > times (as I understand, but I might be wrong): high shared_buffers don't help because in such a random index writing, lots and lots of pages get dirtied, even if the change in the page was minimal. So, in the "15-minute" period, you write the same pages over and over again. Even if you have high shared_buffers, the same page will get sync-ed to disk multiple times (at every checkpoint). The idea of those "other" indexes is to avoid the random writing, maximizing the writing in sequence, even if that means writing more bytes. In other words: writing a full 8KB is no different than write 20 bytes in a page, as we'll have to sync the whole page anyway... I'll try a 15-minute checkpoint interval... and see what happens. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776470.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Wed, Oct 30, 2013 at 10:53 AM, Simon Riggs wrote: > LSM-tree also covers the goal of maintaining just 2 sub-trees and a > concurrent process of merging sub-trees. That sounds like it would > take a lot of additional time to get right and would need some > off-line process to perform the merge. > Not necessarily. Merging means applying insertions/deletions from one subtree to another. While it's normally preferable and more efficient to do it in batches, I've successfully implemented in-memory versions that use other writers to perform the task, amortizing the cost of merging across many operations. In essence, when there's a need to merge two subtrees, an inserting process also merges one entry, so slowly trees get merged. That works in-memory very well, it's quite clear that it's not necessarily generalizable to external storage, but it's a technique to have in mind. Alternatively, vacuum could do it. It's quite clearly a vacuuming task anyway.
Re: [HACKERS] Fast insertion indexes: why no developments
On 31/10/13 06:46, Jeff Janes wrote: On Wed, Oct 30, 2013 at 9:54 AM, Leonardo Francalanci mailto:m_li...@yahoo.it>> wrote: Jeff Janes wrote > The index insertions should be fast until the size of the active part of > the indexes being inserted into exceeds shared_buffers by some amount > (what > that amount is would depend on how much dirty data the kernel is willing > to > allow in the page cache before it starts suffering anxiety about it). If > you have enough shared_buffers to make that last for 15 minutes, then you > shouldn't have a problem inserting with live indexes. Sooner or later you'll have to checkpoint those shared_buffers... True, but that is also true of indexes created in bulk. It all has to reach disk eventually--either the checkpointer writes it out and fsyncs it, or the background writer or user backends writes it out and the checkpoint fsyncs it. If bulk creation uses a ring buffer strategy (I don't know if it does), then it might kick the buffers to kernel in more or less physical order, which would help the kernel get them to disk in long sequential writes. Or not. I think that this is where sorted checkpoint could really help. > and we are > talking about GB of data (my understanding is that we change basically every > btree page, resulting in re-writing of the whole index). If the checkpoint interval is as long as the partitioning period, then hopefully the active index buffers get re-dirtied while protected in shared_buffers, and only get written to disk once. If the buffers get read, dirtied, and evicted from a small shared_buffers over and over again then you are almost guaranteed that will get written to disk multiple times while they are still hot, unless your kernel is very aggressive about caching dirty data (which will cause other problems). Cheers, Jeff How about being able to mark indexes: 'MEMORY ONLY' to make them not go to disk and 'PERSISTENT | TRANSIENT' to mark if they should be recreated on machine bootup? or something similar Cheers, Gavin
Re: [HACKERS] Fast insertion indexes: why no developments
On Wed, Oct 30, 2013 at 9:54 AM, Leonardo Francalanci wrote: > Jeff Janes wrote > > The index insertions should be fast until the size of the active part of > > the indexes being inserted into exceeds shared_buffers by some amount > > (what > > that amount is would depend on how much dirty data the kernel is willing > > to > > allow in the page cache before it starts suffering anxiety about it). If > > you have enough shared_buffers to make that last for 15 minutes, then you > > shouldn't have a problem inserting with live indexes. > > Sooner or later you'll have to checkpoint those shared_buffers... > True, but that is also true of indexes created in bulk. It all has to reach disk eventually--either the checkpointer writes it out and fsyncs it, or the background writer or user backends writes it out and the checkpoint fsyncs it. If bulk creation uses a ring buffer strategy (I don't know if it does), then it might kick the buffers to kernel in more or less physical order, which would help the kernel get them to disk in long sequential writes. Or not. I think that this is where sorted checkpoint could really help. > and we are > talking about GB of data (my understanding is that we change basically every > btree page, resulting in re-writing of the whole index). If the checkpoint interval is as long as the partitioning period, then hopefully the active index buffers get re-dirtied while protected in shared_buffers, and only get written to disk once. If the buffers get read, dirtied, and evicted from a small shared_buffers over and over again then you are almost guaranteed that will get written to disk multiple times while they are still hot, unless your kernel is very aggressive about caching dirty data (which will cause other problems). Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > You could periodically merge older partitions into larger tables, index > those aggregated tables, then transactionally disinherit the old > partitions > and inherit the new aggregated one. This would keep the value of K down, > at the expense of re-writing data multiple times (but all method write > data > multiple times, some just hide it from you). Forgot to add: I thought also that we could: - use the RAM as tablespace for indexes, and move the indexes later (but postgresql doesn't handle very well a machine crash in this case... it would be cool to create an index as "recreate on crash"...) - use unlogged tables and turn those to logged to speed up somehow the insertion; I actually started to write a patch for it, but making it work for replication was too hard (not that I'm using replication, but it wouldn't be accepted for "wal_level = minimal"). But this wouldn't solve the problem anyway. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776418.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > Are partitions read-only once time has moved on, or can stragglers show up > that need to be inserted into older partitions? > > You could periodically merge older partitions into larger tables, index > those aggregated tables, then transactionally disinherit the old > partitions > and inherit the new aggregated one. This would keep the value of K down, > at the expense of re-writing data multiple times (but all method write > data > multiple times, some just hide it from you). Yes, we could "merge" the partitions: the idea was to merge them during night hour, when traffic is low ( and NSA people are sleeping ;) ) Jeff Janes wrote > By the way, what is the transaction structure of your inserts? Are they > large batches between commits, or is each row committed? Of course large batches (using COPY) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776416.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Point being: hardware is marching along pretty fast (after 20+ years > of stagnation) and it's dangerous (IMO) to make big software > investments based on the situation on the ground *today*. Yes, that's a good point. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > The index insertions should be fast until the size of the active part of > the indexes being inserted into exceeds shared_buffers by some amount > (what > that amount is would depend on how much dirty data the kernel is willing > to > allow in the page cache before it starts suffering anxiety about it). If > you have enough shared_buffers to make that last for 15 minutes, then you > shouldn't have a problem inserting with live indexes. Sooner or later you'll have to checkpoint those shared_buffers... and we are talking about GB of data (my understanding is that we change basically every btree page, resulting in re-writing of the whole index). -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776413.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Wed, Oct 30, 2013 at 4:23 AM, Leonardo Francalanci wrote: > > > 1) I haven't achieved what I need: realtime indexing. I can't query the > "current 15 minutes" table efficiently. Plus, K*log(N) is not that great > when you have a lot of K. > Are partitions read-only once time has moved on, or can stragglers show up that need to be inserted into older partitions? You could periodically merge older partitions into larger tables, index those aggregated tables, then transactionally disinherit the old partitions and inherit the new aggregated one. This would keep the value of K down, at the expense of re-writing data multiple times (but all method write data multiple times, some just hide it from you). > 2) I'm not suggesting that this is top priority. I'm asking if there's > something else, other than "we don't have time for this", that I don't > know. In fact, I don't even know if those indexes types would really help > in my (specific) case. That's why my original question was "why aren't > there developments in this area": I didn't mean to imply someone should do > it. I just wanted to know if those indexes were already discussed (and > maybe dismissed for some reason) in the past... > > There have been several ideas discussed, but not a whole lot of time to implement them. By the way, what is the transaction structure of your inserts? Are they large batches between commits, or is each row committed? Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
On Wed, Oct 30, 2013 at 3:35 AM, Leonardo Francalanci wrote: > > Presumably the data you are inserting isn't actually random. Please > > describe the use case you are considering in more detail and some view > > on how frequent that is, with some examples. Once we understand the > > use case and agree it is important, we might solve problems. > > > Collecting calls data for mobile network operators (and no, I don't work > for the NSA...) > Easily 5000-1 inserts per second. Indexes in timestamp and ID (not a > problem, always increasing so no btree issues) and in called #, calling #, > imsi, imei. The last four obviously are random, out of millions of possible > values. > After the few first millions of records, the disks can't keep up with the > amount of random writing in the indexes. So, like, 3 minutes worth? How much RAM and shared_buffers do you have? The index insertions should be fast until the size of the active part of the indexes being inserted into exceeds shared_buffers by some amount (what that amount is would depend on how much dirty data the kernel is willing to allow in the page cache before it starts suffering anxiety about it). If you have enough shared_buffers to make that last for 15 minutes, then you shouldn't have a problem inserting with live indexes. Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
On Wed, Oct 30, 2013 at 9:23 AM, Leonardo Francalanci wrote: >> LSM-trees seem patent free > > I'm no expert, and I gave it just a look some time ago: it looked to me very > complicated to get right... and as far as I remember you don't get that much > gain, unless you go multi-level which would complicate things further > >> Please somebody advise patent status of Y-trees otherwise I wouldn't bother. > > y-trees look much more easier to get right... (and to me they also make more > sense, but I'm not skilled enough to judge). > > There's also the FD-tree, which looks a lot like the (patented...) fractal > tree: > http://www.ntu.edu.sg/home/bshe/fdtree_pvldb.pdf Skimming the white paper, it's clear right from the start that they make assumptions about the hardware that appear to be already obsolete -- extremely poor write performance vs read performance of SSD. Later generation SSDs are increasingly using hardware optimizations to convert random writes to sequential writes using various tricks. Point being: hardware is marching along pretty fast (after 20+ years of stagnation) and it's dangerous (IMO) to make big software investments based on the situation on the ground *today*. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 30 October 2013 11:23, Leonardo Francalanci wrote: > > >> In terms of generality, do you think its worth a man year of developer > >> effort to replicate what you have already achieved? Who would pay? > I work on an application that does exactly what Leonardo described. We hit the exact same problem, and came up with the same exact same solution (down to the 15 minutes interval). But I have also worked on other various datamarts (all using Oracle), and they are all subject to this problem in some form: B-tree indexes slow down bulk data inserts too much and need to be disabled or dropped and then recreated after the load. In some cases this is done easily enough, in others it's more complicated (example: every day, a process imports from 1 million to 1 billion records into a table partition that may contain from 0 to 1 billion records. To be as efficient as possible, you need some logic to compare the number of rows to insert to the number of rows already present, in order to decide whether to drop the indexes or not). Basically, my point is that this is a common problem for datawarehouses and datamarts. In my view, indexes that don't require developers to work around poor insert performance would be a significant feature in a "datawarehouse-ready" DBMS. Yann
Re: [HACKERS] Fast insertion indexes: why no developments
> LSM-trees seem patent free I'm no expert, and I gave it just a look some time ago: it looked to me very complicated to get right... and as far as I remember you don't get that much gain, unless you go multi-level which would complicate things further > Please somebody advise patent status of Y-trees otherwise I wouldn't bother. y-trees look much more easier to get right... (and to me they also make more sense, but I'm not skilled enough to judge). There's also the FD-tree, which looks a lot like the (patented...) fractal tree: http://www.ntu.edu.sg/home/bshe/fdtree_pvldb.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 30 October 2013 11:23, Leonardo Francalanci wrote: >> What is the reason for needing such fast access to individual groups >> of records? Sure sounds like the NSA or similar ;-) > > > Users need to search all calls originated from/to a user or from/to a > specific mobile phone to answer/analyze customers' probl... ok, I give up: I > work for the NSA ;) > >> In terms of generality, do you think its worth a man year of developer >> effort to replicate what you have already achieved? Who would pay? > > > 1) I haven't achieved what I need: realtime indexing. I can't query the > "current 15 minutes" table efficiently. Plus, K*log(N) is not that great when > you have a lot of K. > 2) I'm not suggesting that this is top priority. I'm asking if there's > something else, other than "we don't have time for this", that I don't know. > In fact, I don't even know if those indexes types would really help in my > (specific) case. That's why my original question was "why aren't there > developments in this area": I didn't mean to imply someone should do it. I > just wanted to know if those indexes were already discussed (and maybe > dismissed for some reason) in the past... OK, I understand now. LSM-trees seem patent free, since open source implementations exist. The main concept is to partition the index into multiple trees, so that the current insertion sub-tree fits more easily in memory. That sounds good and was exactly the solution I'd come up with as well, which is a good cross check. It leads to a slow increase in index response times, but we could offset that by having minimum values on each subtree and using partitioning logic as with a minmax index. LSM-tree also covers the goal of maintaining just 2 sub-trees and a concurrent process of merging sub-trees. That sounds like it would take a lot of additional time to get right and would need some off-line process to perform the merge. Please somebody advise patent status of Y-trees otherwise I wouldn't bother. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 2013-10-30 12:08, Simon Riggs wrote: effort to replicate what you have already achieved? Who would pay? The NSA, obviously ;-) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> What is the reason for needing such fast access to individual groups > of records? Sure sounds like the NSA or similar ;-) Users need to search all calls originated from/to a user or from/to a specific mobile phone to answer/analyze customers' probl... ok, I give up: I work for the NSA ;) > In terms of generality, do you think its worth a man year of developer > effort to replicate what you have already achieved? Who would pay? 1) I haven't achieved what I need: realtime indexing. I can't query the "current 15 minutes" table efficiently. Plus, K*log(N) is not that great when you have a lot of K. 2) I'm not suggesting that this is top priority. I'm asking if there's something else, other than "we don't have time for this", that I don't know. In fact, I don't even know if those indexes types would really help in my (specific) case. That's why my original question was "why aren't there developments in this area": I didn't mean to imply someone should do it. I just wanted to know if those indexes were already discussed (and maybe dismissed for some reason) in the past... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 30 October 2013 10:35, Leonardo Francalanci wrote: >> Presumably the data you are inserting isn't actually random. Please >> describe the use case you are considering in more detail and some view >> on how frequent that is, with some examples. Once we understand the >> use case and agree it is important, we might solve problems. > > > Collecting calls data for mobile network operators (and no, I don't work for > the NSA...) > Easily 5000-1 inserts per second. Indexes in timestamp and ID (not a > problem, always increasing so no btree issues) and in called #, calling #, > imsi, imei. The last four obviously are random, out of millions of possible > values. > After the few first millions of records, the disks can't keep up with the > amount of random writing in the indexes. Workaround: the table is partitioned > every 15 minutes, and indexes created in bulk after we "start" the new > 15-minutes partition. Searches on current 15 minutes are not allowed (as it > is not indexed), and searches on older data are K*log(N) (where K is the > number of partitions). > Yes, I could throw more disks, use ssd, sharding more, etc etc. But I still > think that btree just aren't fit for this kind of problem. I don't delete > data, I don't update data, there's not that much concurrency going on. I > would sacrifice search speed (K*log(N) is already much slower than "regular" > btree usage) for realtime insertion. > > I don't think I'm the only one having a big system to be indexed by random > values. > > In fact, I didn't want to turn this thread into a "help me with this > workload" thread. I just wanted to know if there was some other known issues > with these "different indexes" other than "not enough time to implement them > correctly": I was afraid that someone already dismissed them as "good in > theory, bad in practice"... What is the reason for needing such fast access to individual groups of records? Sure sounds like the NSA or similar ;-) Sacrificing timeliness for efficiency is a common solution. I'm seeing lots of areas where being able to specify the timeliness that is acceptable in a query leads to various optimisations of this and similar. Indexes are a declarative solution. We would need to be able to specify the tolerances to be able to do this. (You can write your own index...) In terms of generality, do you think its worth a man year of developer effort to replicate what you have already achieved? Who would pay? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Presumably the data you are inserting isn't actually random. Please > describe the use case you are considering in more detail and some view > on how frequent that is, with some examples. Once we understand the > use case and agree it is important, we might solve problems. Collecting calls data for mobile network operators (and no, I don't work for the NSA...) Easily 5000-1 inserts per second. Indexes in timestamp and ID (not a problem, always increasing so no btree issues) and in called #, calling #, imsi, imei. The last four obviously are random, out of millions of possible values. After the few first millions of records, the disks can't keep up with the amount of random writing in the indexes. Workaround: the table is partitioned every 15 minutes, and indexes created in bulk after we "start" the new 15-minutes partition. Searches on current 15 minutes are not allowed (as it is not indexed), and searches on older data are K*log(N) (where K is the number of partitions). Yes, I could throw more disks, use ssd, sharding more, etc etc. But I still think that btree just aren't fit for this kind of problem. I don't delete data, I don't update data, there's not that much concurrency going on. I would sacrifice search speed (K*log(N) is already much slower than "regular" btree usage) for realtime insertion. I don't think I'm the only one having a big system to be indexed by random values. In fact, I didn't want to turn this thread into a "help me with this workload" thread. I just wanted to know if there was some other known issues with these "different indexes" other than "not enough time to implement them correctly": I was afraid that someone already dismissed them as "good in theory, bad in practice"... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 30 October 2013 07:55, Leonardo Francalanci wrote: >> Hmm, you realise Alvaro is working on MinMax indexes in this release? >> They are very efficient with regard to index inserts and specially >> designed for use on large tables. >> >> Prior work by Heikki on Grouped Item Tuples was a way of reducing the >> size of indexes, yet still allowing uniqueness checks. That is >> implemented in SQLServer already and is very useful. > > > Reading the implementation of those features, I don't think they can help in > the cases handled by the index types I mentioned (insertions of random values > in big tables). Presumably the data you are inserting isn't actually random. Please describe the use case you are considering in more detail and some view on how frequent that is, with some examples. Once we understand the use case and agree it is important, we might solve problems. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Hmm, you realise Alvaro is working on MinMax indexes in this release? > They are very efficient with regard to index inserts and specially > designed for use on large tables. > > Prior work by Heikki on Grouped Item Tuples was a way of reducing the > size of indexes, yet still allowing uniqueness checks. That is > implemented in SQLServer already and is very useful. Reading the implementation of those features, I don't think they can help in the cases handled by the index types I mentioned (insertions of random values in big tables). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Hmm, you realise Alvaro is working on MinMax indexes in this release? > They are very efficient with regard to index inserts and specially > designed for use on large tables. > > Prior work by Heikki on Grouped Item Tuples was a way of reducing the > size of indexes, yet still allowing uniqueness checks. That is > implemented in SQLServer already and is very useful. Ah! Didn't know that! > Your comment about the lack of development in indexes seems counter to > the literature that I've seen. The main problem is people keep > patenting things, making it fairly difficult for everyone. Mmh, maybe I wasn't clear: I meant lack of development (maybe I should have said "implementation"?) in postgresql and in the other "sql databases" of the fast-insertion indexes you can find in literature. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 29 October 2013 07:53, Leonardo Francalanci wrote: > I don't see much interest in insert-efficient indexes. Hmm, you realise Alvaro is working on MinMax indexes in this release? They are very efficient with regard to index inserts and specially designed for use on large tables. Prior work by Heikki on Grouped Item Tuples was a way of reducing the size of indexes, yet still allowing uniqueness checks. That is implemented in SQLServer already and is very useful. Your comment about the lack of development in indexes seems counter to the literature that I've seen. The main problem is people keep patenting things, making it fairly difficult for everyone. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes writes: > Robert removed the lmgr lock on the meta page by using a retry loop with > lightweight locks. I've outlined how to remove the heavyweight lock on the > bucket page as well, but it would require an on-disk change to the index so > that each page knows how far the bucket it is in has been split, and it > also might abuse the intention of lightweight locks a bit. FWIW, I don't think that on-disk changes to hash indexes would be a showstopper problem at this point. We could force people to reindex them by means of changing the index version number on the metapage. The reindex downtime would be annoying for production situations --- but given the lack of WAL support, who'd be using one in production anyway? > But I'm > reluctant to put much time into that without there being any prospects of > solving the problem of how to WAL bucket splits when buckets can have an > unbounded number of overflow pages. Agreed, if we don't see how to implement WAL logging then it's improbable they'll ever get to production quality :-(. ISTM the issue here is that we'd need to acknowledge incompletely-split buckets as a valid state, no? But that could be a good thing anyway, if it'd mean that we don't have to completely lock a bucket while splitting it. All the other index types have comparable situations where a maintenance operation might be only partly done. Not that I'm volunteering to put any time into this myself. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Oct 29, 2013 at 8:16 AM, Tom Lane wrote: > Leonardo Francalanci writes: > >> Before getting too excited about some new academic index type, it's > worth > >> noting the sad state in which hash indexes have languished for years. > > > Aren't hash indexes in a poor state because they are not faster than > btree in every condition? > > They should, in theory, be faster than btrees -- O(1) not O(log N) page > fetches per lookup. However, all but one or two of those page fetches are almost surely cached, so if the problem is IO then the benefits are not likely to be seen. > In practice they don't seem to be faster, and > nobody's bothered to find out exactly why. We know why, more or less. Hash indexes use lmgr locks to protect against bucket splits conflicting with ordinary operations, and that destroys performance even in isolation, and destroys it even more in concurrent situations. Robert removed the lmgr lock on the meta page by using a retry loop with lightweight locks. I've outlined how to remove the heavyweight lock on the bucket page as well, but it would require an on-disk change to the index so that each page knows how far the bucket it is in has been split, and it also might abuse the intention of lightweight locks a bit. But I'm reluctant to put much time into that without there being any prospects of solving the problem of how to WAL bucket splits when buckets can have an unbounded number of overflow pages. (Once each page knows its own split level, we could also remove the need for even a light-weight lock on the metapage for most operations by stuffing some of the key info from that into the relcache.) Cheers, Jeff
Re: [HACKERS] Fast insertion indexes: why no developments
> I bet you've mis-diagnosed the problem. Btrees don't have a problem > keeping up with 50m records; you're problem is that after a certain > point your page cache can't keep up with the pseudo-random i/o > patterns and you start seeing faults to storage. > [...] This has nothing to do the btree algorithm except to the > extent it affects i/o patterns. Of course; that's why those "different" index types aim to use more sequential than random writes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci wrote: > I don't see much interest in insert-efficient indexes. Presumably someone will get around to implementing a btree index insertion buffer one day. I think that would be a particularly compelling optimization for us, because we could avoid ever inserting index tuples that are already dead when the deferred insertion actually occurs. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Oct 29, 2013 at 10:49 AM, Leonardo Francalanci wrote: >> Another point to add: I don't really see btree as a barrier to >> performance for most of the problems I face. The real barriers to >> database performance are storage, contention, and query planning. > > Ehm that's true for regular OLTP stuff, which I understand is what most > (95%?) of people use/need. But if you try to insert rows into a 50M table > with a couple of indexes, btrees just can't keep up. > Of course, you can't have it all: fast at big table insertion, good > contention, good query times... > >> Postgres btreee indexes are pretty fast and for stuff like bulk >> insertions there are some optimization techniques available (such as >> sharding or create index concurrently). > > > At the moment I'm relying on partitioning + creating indexes in bulk on > "latest" table (the partitioning is based on time). But that means K*log(N) > search times (where K is the number of partitions). > That's why I gave a look at these different indexing mechanisms. I bet you've mis-diagnosed the problem. Btrees don't have a problem keeping up with 50m records; you're problem is that after a certain point your page cache can't keep up with the pseudo-random i/o patterns and you start seeing faults to storage. Disk storage is several order of magnitude slower than memory and thus performance collapses. This has nothing to do the btree algorithm except to the extent it affects i/o patterns. With the advances in storage over the last several years such that commodity priced SSD is available I think that all lot of assumptions under these trade-offs will change. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Oct 29, 2013 at 1:10 PM, Peter Geoghegan wrote: > On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci > wrote: > > I don't see much interest in insert-efficient indexes. > > Presumably someone will get around to implementing a btree index > insertion buffer one day. I think that would be a particularly > compelling optimization for us, because we could avoid ever inserting > index tuples that are already dead when the deferred insertion > actually occurs. Well, that should be relatively easy the way web mining does it (with inverted indexes). Have a small (presumably RAM-fitting) staging index where inserts take place, and regularly dump it into the "master index", the rationale being that by the time you dump it, it'll be more efficient to do many inserts at once for one, and there will be lots of dead tuples you don't even have to consider for two. And when I say relatively easy, I mean it in the sense that it only needs careful juggling of existing data structures.
Re: [HACKERS] Fast insertion indexes: why no developments
> They should, in theory, be faster than btrees -- O(1) not O(log N) page > fetches per lookup. In practice they don't seem to be faster, and > nobody's bothered to find out exactly why. Again, this isn't a terribly > encouraging precedent for implementing some other index type that's > supposed to (sometimes) be faster than btrees. Yes, I understand. Which is also why I was curious to know if the "claims" those papers (and the databases using them) make were real... Thank you everybody for your replies. Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Another point to add: I don't really see btree as a barrier to > performance for most of the problems I face. The real barriers to > database performance are storage, contention, and query planning. Ehm that's true for regular OLTP stuff, which I understand is what most (95%?) of people use/need. But if you try to insert rows into a 50M table with a couple of indexes, btrees just can't keep up. Of course, you can't have it all: fast at big table insertion, good contention, good query times... > Postgres btreee indexes are pretty fast and for stuff like bulk > insertions there are some optimization techniques available (such as > sharding or create index concurrently). At the moment I'm relying on partitioning + creating indexes in bulk on "latest" table (the partitioning is based on time). But that means K*log(N) search times (where K is the number of partitions). That's why I gave a look at these different indexing mechanisms. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Leonardo Francalanci writes: >> Before getting too excited about some new academic index type, it's worth >> noting the sad state in which hash indexes have languished for years. > Aren't hash indexes in a poor state because they are not faster than btree in > every condition? They should, in theory, be faster than btrees -- O(1) not O(log N) page fetches per lookup. In practice they don't seem to be faster, and nobody's bothered to find out exactly why. Again, this isn't a terribly encouraging precedent for implementing some other index type that's supposed to (sometimes) be faster than btrees. None of this is meant to discourage you from trying to write an index type if you have the time and motivation to pursue it. Just trying to answer your question as to why nobody's done it already. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Oct 29, 2013 at 02:53:37PM +, Leonardo Francalanci wrote: > > Before getting too excited about some new academic index type, it's worth > > noting the sad state in which hash indexes have languished for years. > > Nobody's bothered to add WAL support, let alone do any other real work > > on them. The non-btree index types that have been getting love are the > > ones that offer the ability to index queries that btree can't. I think > > a new index type whose only benefit is the claim to be faster in a narrow > > use-case is likely to end up like hash, not getting used enough to be > > properly maintained. > > regards, tom lane > > Aren't hash indexes in a poor state because they are not faster than btree in > every condition? > Hi Leonardo, If there was ONE perfect index, better in every condition, postgres would be using it. As in everything else, each type has its strengths and weaknesses. The hash index allows equality searches for very large key lengths using a relatively very small index size. As has been mentioned before, we still do not have WAL logging for hash indexes. But even so, for I/O bound systems hash indexes are twice as fast for searches than the btree equivalent. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Leonardo Francalanci wrote: > > Before getting too excited about some new academic index type, it's worth > > noting the sad state in which hash indexes have languished for years. > > Nobody's bothered to add WAL support, let alone do any other real work > > on them. The non-btree index types that have been getting love are the > > ones that offer the ability to index queries that btree can't. I think > > a new index type whose only benefit is the claim to be faster in a narrow > > use-case is likely to end up like hash, not getting used enough to be > > properly maintained. > > Aren't hash indexes in a poor state because they are not faster than > btree in every condition? Chicken and egg. Maybe they can be made faster than btrees (in some situations) with enough tweaks, but because there are so many outstanding problems, no one wants to do the huge amount of legwork to even get to the point where such tweaks can be made in the first place. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Before getting too excited about some new academic index type, it's worth > noting the sad state in which hash indexes have languished for years. > Nobody's bothered to add WAL support, let alone do any other real work > on them. The non-btree index types that have been getting love are the > ones that offer the ability to index queries that btree can't. I think > a new index type whose only benefit is the claim to be faster in a narrow > use-case is likely to end up like hash, not getting used enough to be > properly maintained. > regards, tom lane Aren't hash indexes in a poor state because they are not faster than btree in every condition? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Tue, Oct 29, 2013 at 2:53 AM, Leonardo Francalanci wrote: > Hi, > > > I don't see much interest in insert-efficient indexes. These are the ones > I've found: > > - LSM-tree (used by Cassandra and SQLite4?) > - Y-Tree > (http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf > ) > - Fractal indexes (TokuDB, patented) > > While I understand that b*trees are still the best compromise in > insertion/search speed, disk size, concurrency, and more in general in OLTP > workloads, they are useless when it comes to insertion in big data tables > (>50M rows) of random values (not ordered values). > > I would like to know if the lack of development in this area (not only in > Postgresql, but in databases in general) is due to: > > 1) complex implementation > 2) poor search performance > 3) poor concurrency performance > 4) not interesting for most users > 5) something else??? > > I thought this was going to change due to the fast-insertion speeds needs of > "Social Applications", but only TokuDB seems to be the only "successful" > player in the area (I don't know how much of it is due to good marketing). > Most other DB technology claims faster insertion speed (MongoDB and the > like...) but in the end they rely on the old b*tree + sharding instead of > using different indexing mechanisms (with the exception of Cassandra). Another point to add: I don't really see btree as a barrier to performance for most of the problems I face. The real barriers to database performance are storage, contention, and query planning. Postgres btreee indexes are pretty fast and for stuff like bulk insertions there are some optimization techniques available (such as sharding or create index concurrently). Stuff I'd like to see in terms of postgres indexing: *) faster wal logged hash index *) composite gist/gin *) faster gist/gin (to the extent that it's possible). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Craig Ringer writes: > On 10/29/2013 03:53 PM, Leonardo Francalanci wrote: >> 5) something else??? > Quite likely nobody has had the enthusiasm and interest to implement a > viable, quality implementation and stick with it long enough to get it > committed. > There are a great many good ideas for improvements to Pg that just don't > have the people and time behind them to make them happen. Before getting too excited about some new academic index type, it's worth noting the sad state in which hash indexes have languished for years. Nobody's bothered to add WAL support, let alone do any other real work on them. The non-btree index types that have been getting love are the ones that offer the ability to index queries that btree can't. I think a new index type whose only benefit is the claim to be faster in a narrow use-case is likely to end up like hash, not getting used enough to be properly maintained. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 10/29/2013 03:53 PM, Leonardo Francalanci wrote: > 5) something else??? Quite likely nobody has had the enthusiasm and interest to implement a viable, quality implementation and stick with it long enough to get it committed. There are a great many good ideas for improvements to Pg that just don't have the people and time behind them to make them happen. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fast insertion indexes: why no developments
Hi, I don't see much interest in insert-efficient indexes. These are the ones I've found: - LSM-tree (used by Cassandra and SQLite4?) - Y-Tree (http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf ) - Fractal indexes (TokuDB, patented) While I understand that b*trees are still the best compromise in insertion/search speed, disk size, concurrency, and more in general in OLTP workloads, they are useless when it comes to insertion in big data tables (>50M rows) of random values (not ordered values). I would like to know if the lack of development in this area (not only in Postgresql, but in databases in general) is due to: 1) complex implementation 2) poor search performance 3) poor concurrency performance 4) not interesting for most users 5) something else??? I thought this was going to change due to the fast-insertion speeds needs of "Social Applications", but only TokuDB seems to be the only "successful" player in the area (I don't know how much of it is due to good marketing). Most other DB technology claims faster insertion speed (MongoDB and the like...) but in the end they rely on the old b*tree + sharding instead of using different indexing mechanisms (with the exception of Cassandra). Thank you in advance Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers