Re: [HACKERS] multivariate statistics / patch v7
Tomas, > attached is v7 of the multivariate stats patch. The main improvement is > major refactoring of the clausesel.c portion - splitting the awfully > long spaghetti-style functions into smaller pieces, making it much more > understandable etc. So presumably v7 handles varlena attributes as well, yes? I have a destruction test case for correlated column stats, so I'd like to test your patch on it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 09/24/2015 06:43 PM, Josh Berkus wrote: Tomas, attached is v7 of the multivariate stats patch. The main improvement is major refactoring of the clausesel.c portion - splitting the awfully long spaghetti-style functions into smaller pieces, making it much more understandable etc. So presumably v7 handles varlena attributes as well, yes? I have a destruction test case for correlated column stats, so I'd like to test your patch on it. Yes, it should handle varlena OK. Let me know if you need help with that, and I'd like to hear feedback - whether it fixed your test case or not, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
On Fri, Jul 31, 2015 at 6:28 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: [series of arguments] If you need stats without these issues you'll have to use MCV list or a histogram. Trying to fix the simple statistics types is futile, IMHO. Patch is marked as returned with feedback. There has been advanced discussions and reviews as well. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 07/30/2015 10:21 AM, Heikki Linnakangas wrote: On 05/25/2015 11:43 PM, Tomas Vondra wrote: There are 6 files attached, but only 0002-0006 are actually part of the multivariate statistics patch itself. All of these patches are huge. In order to review this in a reasonable amount of time, we need to do this in several steps. So let's see what would be the minimal set of these patches that could be reviewed and committed, while still being useful. The main patches are: 1. shared infrastructure and functional dependencies 2. clause reduction using functional dependencies 3. multivariate MCV lists 4. multivariate histograms 5. multi-statistics estimation Would it make sense to commit only patches 1 and 2 first? Would that be enough to get a benefit from this? I agree that the patch can't be reviewed as a single chunk - that was the idea when I split the original (single chunk) patch into multiple smaller pieces. And yes, I believe committing pieces 12 might be enough to get something useful, which can then be improved by adding the usual MCV and histogram stats on top of that. I have some doubts about the clause reduction and functional dependencies part of this. It seems to treat functional dependency as a boolean property, but even with the classic zipcode and city case, it's not always an all or nothing thing. At least in some countries, there can be zipcodes that span multiple cities. So zipcode=X does not completely imply city=Y, although there is a strong correlation (if that's the right term). How strong does the correlation need to be for this patch to decide that zipcode implies city? I couldn't actually see a clear threshold stated anywhere. So rather than treating functional dependence as a boolean, I think it would make more sense to put a 0.0-1.0 number to it. That means that you can't do clause reduction like it's done in this patch, where you actually remove clauses from the query for cost esimation purposes. Instead, you need to calculate the selectivity for each clause independently, but instead of just multiplying the selectivities together, apply the dependence factor to it. Does that make sense? I haven't really looked at the MCV, histogram and multi-statistics estimation patches yet. Do those patches make the clause reduction patch obsolete? Should we forget about the clause reduction and functional dependency patch, and focus on those later patches instead? Perhaps. It's true that most real-world data sets are not 100% valid with respect to functional dependencies - either because of natural imperfections (multiple cities with the same ZIP code) or just noise in the data (incorrect entries ...). And it's even mentioned in the code comments somewhere, I guess. But there are two main reasons why I chose not to extend the functional dependencies with the [0.0-1.0] value you propose. Firstly, functional dependencies were meant to be the simplest possible implementation, illustrating how the infrastructure is supposed to work (which is the main topic of the first patch). Secondly, all kinds of statistics are simplifications of the actual data. So I think it's not incorrect to ignore the exceptions up to some threshold. I also don't think this will make the estimates globally better. Let's say you have 1% of rows that contradict the functional dependency - you may either ignore them and have good estimates for 99% of the values and incorrect estimates for 1%, or tweak the rule a bit and make the estimates worse for 99% (and possibly better for 1%). That being said, I'm not against improving the functional dependencies. I already do have some improvements on my TODO - like for example dependencies on more columns (not just A=B but [A,B]=C and such), but I think we should not squash this into those two patches. And yet another point - ISTM these cases might easily be handled better by the statistics based on ndistinct coefficients, as proposed by Kyotaro-san some time ago. That is, compute and track ndistinct(A) * ndistinct(B) / ndistinct(A,B) for all pairs of columns (or possibly larger groups). That seems to be similar to the coefficient you propose. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 07/30/2015 06:58 PM, Heikki Linnakangas wrote: The problem with a threshold is that around that threshold, even a small change in the data set can drastically change the produced estimates. For example, imagine that we know from the stats that zip code implies city. But then someone adds a single row to the table with an odd zip code city combination, which pushes the estimator over the threshold, and the columns are no longer considered dependent, and the estimates are now completely different. We should avoid steep cliffs like that. BTW, what is the threshold in the current patch? There's not a simple threshold - the algorithm mining the functional dependencies is a bit more complicated. I tried to explain it in the comment before build_mv_dependencies (in dependencies.c), but let me briefly summarize it here. To mine dependency [A = B], build_mv_dependencies does this: (1) sort the sample by {A,B} (2) split the sample into groups with the same value of A (3) for each group, decide if it's consistent with the dependency (a) if the group is too small (less than 3 rows), ignore it (a) if the group is consistent, update n_supporting n_supporting_rows (b) if the group is inconsistent, update n_contradicting n_contradicting_rows (4) decide whether the dependency is valid by checking n_supporting_rows = n_contradicting_rows * 10 The limit is rather arbitrary and yes - I can imagine a more complex condition (e.g. looking at average number of tuples per group etc.), but I haven't looked into that - the point was to use something very simple, only to illustrate the infrastructure. I think we might come up with some elaborate way of associating degree with the functional dependency, but at that point we really loose the simplicity, and also make it indistinguishable from the remaining statistics (because it won't be possible to reduce the clauses like this, before performing the regular estimation). Which is exactly what makes the functional dependencies so neat and efficient, so I'm not overly enthusiastic about doing that. What seems more interesting is implementing the ndistinct coefficient instead, as proposed by Kyotaro-san - that seems to have the nice smooth behavior you desire, while keeping the simplicity. Both statistics types (functional dependencies and ndistinct coeff) have one weak point, though - they somehow assume the queries use compatible values. For example if you use a query with WHERE city = 'New York' AND zip = 'zip for Detroit' they can't detect cases like this, because those statistics types are oblivious to individual values. I don't see this as a fatal flaw, though - it's rather a consequence of the nature of the stats. And I tend to look at the functional dependencies the same way. If you need stats without these issues you'll have to use MCV list or a histogram. Trying to fix the simple statistics types is futile, IMHO. regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
On 05/25/2015 11:43 PM, Tomas Vondra wrote: There are 6 files attached, but only 0002-0006 are actually part of the multivariate statistics patch itself. All of these patches are huge. In order to review this in a reasonable amount of time, we need to do this in several steps. So let's see what would be the minimal set of these patches that could be reviewed and committed, while still being useful. The main patches are: 1. shared infrastructure and functional dependencies 2. clause reduction using functional dependencies 3. multivariate MCV lists 4. multivariate histograms 5. multi-statistics estimation Would it make sense to commit only patches 1 and 2 first? Would that be enough to get a benefit from this? I have some doubts about the clause reduction and functional dependencies part of this. It seems to treat functional dependency as a boolean property, but even with the classic zipcode and city case, it's not always an all or nothing thing. At least in some countries, there can be zipcodes that span multiple cities. So zipcode=X does not completely imply city=Y, although there is a strong correlation (if that's the right term). How strong does the correlation need to be for this patch to decide that zipcode implies city? I couldn't actually see a clear threshold stated anywhere. So rather than treating functional dependence as a boolean, I think it would make more sense to put a 0.0-1.0 number to it. That means that you can't do clause reduction like it's done in this patch, where you actually remove clauses from the query for cost esimation purposes. Instead, you need to calculate the selectivity for each clause independently, but instead of just multiplying the selectivities together, apply the dependence factor to it. Does that make sense? I haven't really looked at the MCV, histogram and multi-statistics estimation patches yet. Do those patches make the clause reduction patch obsolete? Should we forget about the clause reduction and functional dependency patch, and focus on those later patches instead? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
On 07/30/2015 03:55 PM, Tomas Vondra wrote: On 07/30/2015 10:21 AM, Heikki Linnakangas wrote: I have some doubts about the clause reduction and functional dependencies part of this. It seems to treat functional dependency as a boolean property, but even with the classic zipcode and city case, it's not always an all or nothing thing. At least in some countries, there can be zipcodes that span multiple cities. So zipcode=X does not completely imply city=Y, although there is a strong correlation (if that's the right term). How strong does the correlation need to be for this patch to decide that zipcode implies city? I couldn't actually see a clear threshold stated anywhere. So rather than treating functional dependence as a boolean, I think it would make more sense to put a 0.0-1.0 number to it. That means that you can't do clause reduction like it's done in this patch, where you actually remove clauses from the query for cost esimation purposes. Instead, you need to calculate the selectivity for each clause independently, but instead of just multiplying the selectivities together, apply the dependence factor to it. Does that make sense? I haven't really looked at the MCV, histogram and multi-statistics estimation patches yet. Do those patches make the clause reduction patch obsolete? Should we forget about the clause reduction and functional dependency patch, and focus on those later patches instead? Perhaps. It's true that most real-world data sets are not 100% valid with respect to functional dependencies - either because of natural imperfections (multiple cities with the same ZIP code) or just noise in the data (incorrect entries ...). And it's even mentioned in the code comments somewhere, I guess. But there are two main reasons why I chose not to extend the functional dependencies with the [0.0-1.0] value you propose. Firstly, functional dependencies were meant to be the simplest possible implementation, illustrating how the infrastructure is supposed to work (which is the main topic of the first patch). Secondly, all kinds of statistics are simplifications of the actual data. So I think it's not incorrect to ignore the exceptions up to some threshold. The problem with a threshold is that around that threshold, even a small change in the data set can drastically change the produced estimates. For example, imagine that we know from the stats that zip code implies city. But then someone adds a single row to the table with an odd zip code city combination, which pushes the estimator over the threshold, and the columns are no longer considered dependent, and the estimates are now completely different. We should avoid steep cliffs like that. BTW, what is the threshold in the current patch? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
Hello Horiguchi-san, On 07/27/2015 09:04 AM, Kyotaro HORIGUCHI wrote: Hello, At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra tomas.von...@2ndquadrant.com wrote in 55b3fb0b.7000...@2ndquadrant.com Hi, On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote: Hi, I'd like to show you the modified constitution of multivariate statistics application logic. Please find the attached. They apply on your v7 patch. Sadly I do have some trouble getting it to apply correctly :-( So for now all my comments are based on just reading the code. Ah. My modification to rebase to the master for the time should be culprit. Sorry for the dirty patch. # I would recreate the patch if you complained before struggling # with the thing.. The core of the modification is on closesel.c. I attached the patched closesel.c. I don't see any attachment. Perhaps you forgot to actually attach it? My concern about the code at the time was following, - You embedded the logic of multivariate estimation into clauselist_selectivity. I think estimate using multivariate statistics is quite different from the ordinary estimate based on single column stats then they are logically separatable and we should do so. I don't see them as very different, actually quite the opposite. The two kinds of statistics are complementary and should naturally coexist. Perhaps the current code is not perfect and a refactoring would make the code more readable, but I don't think it's primary aim should be to separate regular and multivariate stats. - You are taking top-down approach and it runs tree-walking to check appliability of mv-stats for every stepping down in clause tree. If the subtree found to be mv-applicable, split it to two parts - mv-compatible and non-compatible. These steps requires expression tree walking, which looks using too-much CPU. I'm taking top-down approach because that's what the regular stats do, and also because that's what allows implementing the features that I think are interesting - ability to combine multiple stats in an efficient way, pass conditions and such. I think those two features are very useful and allow very interesting things. The bottom-up would work too, probably - I mean, we could start from leaves of the expression tree, and build the largest subtree compatible with multivariate stats and then try to estimate it. I don't see how we could pass conditions though, which works naturally in the top-down approach. Or maybe a combination of both - identify the compatible subtrees first, then perform the top-down phase. - You look to be considering the cases when users create many multivariate statistics on attribute sets having duplications. But it looks too-much for me. MV-stats are more resource-eating so we can assume the minimum usage of that. Not really. I don't expect huge numbers of multivariate stats to be built on the tables. But I think restricting the users to use a single multivariate statistics per table would be a significant limitation. And once you allow using multiple multivariate statistics for the set of clauses, supporting over-lapping stats is not that difficult. What it however makes possible is combining multiple small stats into a larger one in a very efficient way - it assumes the overlap is sufficient, of course. But if that's true you may build multiple small (and very accurate) stats instead of one huge (or very inaccurate) statistics. This also makes it possible to handle complex combinations of clauses that are compatible and incompatible with multivariate statistics, by passing the conditions. My suggestion in the patch is a bottom-up approach to find mv-applicable portion(s) in the expression tree, which is the basic way of planner overall. The approach requires no repetitive run of tree walker, that is, pull_varnos. It could fail to find the 'optimal' solution for complex situations but needs far less calculation for almost the same return (I think..). Even though it doesn't consider the functional dependency, the reduce of the code shows the efficiency. It does not nothing tricky. OK On a conceptual level, I think the idea to split the estimation into two phases - enrich the expression tree with nodes with details about stats etc, and then actually do the estimation in the second phase might be interesting. Not because it's somehow clearer, but because it gives us a chance to see the expression tree as a whole, with details about all the stats (with the current code we process/estimate the tree incrementally). But I don't really know how useful that would be. It is difficult to say which approach is better sinch it is affected by what we think important than other things. However I concern about that your code substantially reconstructs the expression (clause) tree midst of processing it. I believe it should be a separate phase for simplicity. Of course additional required resource is also
Re: [HACKERS] multivariate statistics / patch v7
Hello, At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra tomas.von...@2ndquadrant.com wrote in 55b3fb0b.7000...@2ndquadrant.com Hi, On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote: Hi, I'd like to show you the modified constitution of multivariate statistics application logic. Please find the attached. They apply on your v7 patch. Sadly I do have some trouble getting it to apply correctly :-( So for now all my comments are based on just reading the code. Ah. My modification to rebase to the master for the time should be culprit. Sorry for the dirty patch. # I would recreate the patch if you complained before struggling # with the thing.. The core of the modification is on closesel.c. I attached the patched closesel.c. FWIW I've rebased my patch to the current master, it's available on github as usual: https://github.com/tvondra/postgres/commits/mvstats Thanks. The code to find mv-applicable clause is moved out of the main flow of clauselist_selectivity. As I said in the previous mail, the new function transformRestrictInfoForEstimate (too bad name but just for PoC:) scans clauselist and generates RestrictStatsData struct which drives mv-aware selectivity calculation. This struct isolates MV and non-MV estimation. The struct RestrictStatData mainly consists of the following three parts, - clause to be estimated by current logic (MV is not applicable) - clause to be estimated by MV-staistics. - list of child RestrictStatDatas, which are to be run recursively. mvclause_selectivty() is the topmost function where mv stats works. This structure effectively prevents main estimation flow from being broken by modifying mvstats part. Although I haven't measured but I'm positive the code is far reduced from yours. I attached two patches to this message. The first one is to rebase v7 patch to current(maybe) master and the second applies the refactoring. I'm a little anxious about performance but I think this makes the process to apply mv-stats far clearer. Regtests for mvstats succeeded asis except for fdep, which is not implememted in this patch. What do you think about this? I'm not sure, at this point. I'm having a hard time understanding how exactly the code works - there are pretty much no comments explaining the implementation, so it takes time to understand the code. This is especially true about transformRestrictInfoForEstimate which is also quite long. I understand it's a PoC, but comments would really help. The patch itself shold hardly readable because it's not from master but from your last patch plus somthing. My concern about the code at the time was following, - You embedded the logic of multivariate estimation into clauselist_selectivity. I think estimate using multivariate statistics is quite different from the ordinary estimate based on single column stats then they are logically separatable and we should do so. - You are taking top-down approach and it runs tree-walking to check appliability of mv-stats for every stepping down in clause tree. If the subtree found to be mv-applicable, split it to two parts - mv-compatible and non-compatible. These steps requires expression tree walking, which looks using too-much CPU. - You look to be considering the cases when users create many multivariate statistics on attribute sets having duplications. But it looks too-much for me. MV-stats are more resource-eating so we can assume the minimum usage of that. My suggestion in the patch is a bottom-up approach to find mv-applicable portion(s) in the expression tree, which is the basic way of planner overall. The approach requires no repetitive run of tree walker, that is, pull_varnos. It could fail to find the 'optimal' solution for complex situations but needs far less calculation for almost the same return (I think..). Even though it doesn't consider the functional dependency, the reduce of the code shows the efficiency. It does not nothing tricky. On a conceptual level, I think the idea to split the estimation into two phases - enrich the expression tree with nodes with details about stats etc, and then actually do the estimation in the second phase might be interesting. Not because it's somehow clearer, but because it gives us a chance to see the expression tree as a whole, with details about all the stats (with the current code we process/estimate the tree incrementally). But I don't really know how useful that would be. It is difficult to say which approach is better sinch it is affected by what we think important than other things. However I concern about that your code substantially reconstructs the expression (clause) tree midst of processing it. I believe it should be a separate phase for simplicity. Of course additional required resource is also should be considered but it is rather reduced for this case. I don't think the proposed change makes the
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote: Hi, I'd like to show you the modified constitution of multivariate statistics application logic. Please find the attached. They apply on your v7 patch. Sadly I do have some trouble getting it to apply correctly :-( So for now all my comments are based on just reading the code. FWIW I've rebased my patch to the current master, it's available on github as usual: https://github.com/tvondra/postgres/commits/mvstats The code to find mv-applicable clause is moved out of the main flow of clauselist_selectivity. As I said in the previous mail, the new function transformRestrictInfoForEstimate (too bad name but just for PoC:) scans clauselist and generates RestrictStatsData struct which drives mv-aware selectivity calculation. This struct isolates MV and non-MV estimation. The struct RestrictStatData mainly consists of the following three parts, - clause to be estimated by current logic (MV is not applicable) - clause to be estimated by MV-staistics. - list of child RestrictStatDatas, which are to be run recursively. mvclause_selectivty() is the topmost function where mv stats works. This structure effectively prevents main estimation flow from being broken by modifying mvstats part. Although I haven't measured but I'm positive the code is far reduced from yours. I attached two patches to this message. The first one is to rebase v7 patch to current(maybe) master and the second applies the refactoring. I'm a little anxious about performance but I think this makes the process to apply mv-stats far clearer. Regtests for mvstats succeeded asis except for fdep, which is not implememted in this patch. What do you think about this? I'm not sure, at this point. I'm having a hard time understanding how exactly the code works - there are pretty much no comments explaining the implementation, so it takes time to understand the code. This is especially true about transformRestrictInfoForEstimate which is also quite long. I understand it's a PoC, but comments would really help. On a conceptual level, I think the idea to split the estimation into two phases - enrich the expression tree with nodes with details about stats etc, and then actually do the estimation in the second phase might be interesting. Not because it's somehow clearer, but because it gives us a chance to see the expression tree as a whole, with details about all the stats (with the current code we process/estimate the tree incrementally). But I don't really know how useful that would be. I don't think the proposed change makes the process somehow clearer. I know it's a PoC at this point, so I don't expect it to be perfect, but for me the original code is certainly clearer. Of course, I'm biased as I wrote the current code, and I (naturally) shaped it to match my ideas during the development process, and I'm much more familiar with it. Omitting the support for functional dependencies is a bit unfortunate, I think. Is that merely to make the PoC simpler, or is there something that makes it impossible to support that kind of stats? Another thing that I noticed is that you completely removed the code that combined multiple stats (and selected the best combination of stats). In other words, you've reverted to the intermediate single statistics approach, including removing the improved handling of OR clauses and conditions. It's a bit difficult to judge the proposed approach not knowing how well it supports those (quite crucial) features. What if it can't support some them., or what if it makes the code much more complicated (thus defeating the goal of making it more clear)? I share your concern about the performance impact - one thing is that this new code might be slower than the original one, but a more serious issue IMHO is that the performance impact will happen even for relations with no multivariate stats at all. The original patch was very careful about getting ~0% overhead in such cases, and if the new code does not allow that, I don't see this approach as acceptable. We must not put additional overhead on people not using multivariate stats. But I think it's worth exploring this idea a bit more - can you rebase it to the current patch version (as on github) and adding the missing pieces (functional dependencies, multi-statistics estimation and passing conditions)? One more thing - I noticed you extended the pg_operator catalog with a oprmvstat attribute, used to flag operators that are compatible with multivariate stats. I'm not happy with the current approach (using oprrest to do this decision), but I'm not really sure this is a good solution either. The culprit is that it only answers one of the two important questions - Is it compatible? How to perform the estimation? So we'd have to rely on oprrest anyway, when actually performing the estimation of a clause with compatible operator. And we'd have to keep in
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 07/13/2015 10:51 AM, Kyotaro HORIGUCHI wrote: Ok, I understood the diferrence between what I thought and what you say. The code is actually concious of OR clause but is looks somewhat confused. I'm not sure which part is confused by the OR clauses, but it's certainly possible. Initially it only handled AND clauses, and the support for OR clauses was added later, so it's possible some parts are not behaving correctly. Currently choosing mv stats in clauselist_selectivity can be outlined as following, 1. find_stats finds candidate mv stats containing *all* attributes appeared in the whole clauses regardless of and/or exprs by walking whole the clause tree. Perhaps this is the measure to early bailout. Not entirely. The goal of find_stats() is to lookup all stats on the 'current' relation - it's coded the way it is because I had to deal with varRelid=0 cases, in which case I have to inspect the Var nodes. But maybe I got this wrong and there's much simpler way to do that? It is an early bailout in the sense that if there are no multivariate stats defined on the table, there's no point in doing any of the following steps. So that we don't increase planning times for users not using multivariate stats. 2.1. Within every disjunction elements, collect mv-related attributes while checking whether the all leaf nodes (binop or ifnull) are compatible by (eventually) walking whole the clause tree. Generally, yes. The idea is to check whether there are clauses that might be estimated using multivariate statistics, and whether the clauses reference at least two different attributes. Imagine a query like this: SELECT * FROM t WHERE (a=1) AND (a0) AND (a100) It makes no sense to process this using multivariate statistics, because all the Var nodes reference a single attribute. Similarly, the check is not just about the leaf nodes - to be able to estimate a clause at this point, we have to be able to process the whole tree, starting from the top-level clause. Although maybe that's no longer true, now that support for OR clauses was added ... I wonder whether there are other BoolExpr-like nodes, that might make the tree incompatible with multivariate statistics (in the sense that the current implementation does not know how to handle them). Also note that even though the clause may be incompatible at this level, it may get partially processed by multivariate statistics later. For example with a query: SELECT * FROM t WHERE (a=1 OR b=2 OR c ~* 'xyz') AND (q=1 OR r=4) the first query is incompatible because it contains unsupported operator '~*', but it will eventually be processed as BoolExpr node, and should be split into two parts - (a=1 OR b=2) which is compatible, and (c ~* 'xyz') which is incompatible. This split should happen in clauselist_selectivity_or(), and the other thing that may be interesting is that it uses (q=1 OR r=4) as a condition. So if there's a statistics built on (a,b,q,r) we'll compute conditional probability P(a=1,b=2 | q=1,r=4) 2.2. Check if all the collected attribute are contained in mv-stats columns. No, I think you got this wrong. We do not check that *all* the attributes are contained in mvstats columns - we only need two such columns (then there's a chance that the multivariate statistics will get applied). Anyway, both 2.1 and 2.2 are meant as a quick bailout, before doing the most expensive part, which is choose_mv_statistics(). Which is however missing in this list. 3. Finally, clauseset_mv_selectivity_histogram() (and others). This funciton applies every ExprOp onto every attribute in every histogram backes and (tries to) make the boolean operation of the result bitmaps. Yes, but this only happens after choose_mv_statistics(), because that's the code that decides which statistics will be used and in what order. The list is also missing handling of the 'functional dependencies', so a complete list of steps would look like this: 1) find_stats - lookup stats on the current relation (from RelOptInfo) 2) apply functional dependencies a) check if there are equality clauses that may be reduced using functional dependencies, referencing at least two columns b) if yes, perform the clause reduction 3) apply MCV lists and histograms a) check if there are clauses 'compatible' with those types of statistics, again containing at least two columns b) if yes, use choose_mv_statistics() to decide which statistics to apply and in which order c) apply the selected histograms and MCV lists 4) estimate the remaining clauses using the regular statistics a) this is where the clauselist_mv_selectivity_histogram and other are called I tried to explain this in the comment before clauselist_selectivity(), but maybe it's not detailed enough / missing some important details. I have some comments on the implement and I
Re: [HACKERS] multivariate statistics / patch v7
Hi, Thanks for the detailed explaination. I misunderstood the code (more honest speaking, din't look so close there). Then I looked it closer. At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra tomas.von...@2ndquadrant.com wrote in 559c76d4.2030...@2ndquadrant.com FWIW this was a stupid bug in update_match_bitmap_histogram(), which initially handled only AND clauses, and thus assumed the match of a bucket can only decrease. But for OR clauses this is exactly the opposite (we assume no buckets match and add buckets matching at least one of the clauses). With this fixed, the estimates look like this: IMHO pretty accurate estimates - no issue with OR clauses. Ok, I understood the diferrence between what I thought and what you say. The code is actually concious of OR clause but is looks somewhat confused. Currently choosing mv stats in clauselist_selectivity can be outlined as following, 1. find_stats finds candidate mv stats containing *all* attributes appeared in the whole clauses regardless of and/or exprs by walking whole the clause tree. Perhaps this is the measure to early bailout. 2.1. Within every disjunction elements, collect mv-related attributes while checking whether the all leaf nodes (binop or ifnull) are compatible by (eventually) walking whole the clause tree. 2.2. Check if all the collected attribute are contained in mv-stats columns. 3. Finally, clauseset_mv_selectivity_histogram() (and others). This funciton applies every ExprOp onto every attribute in every histogram backes and (tries to) make the boolean operation of the result bitmaps. I have some comments on the implement and I also try to find the solution for them. 1. The flow above looks doing very similiar thins repeatedly. 2. I believe what the current code does can be simplified. 3. As you mentioned in comments, some additional infrastructure needed. After all, I think what we should do after this are as follows, as the first step. - Add the means to judge the selectivity operator(?) by other than oprrest of the op of ExprOp. (You missed neqsel already) I suppose one solution for this is adding oprmvstats taking 'm', 'h' and 'f' and their combinations. Or for the convenience, it would be a fixed-length string like this. oprname | oprmvstats = | 'mhf' | 'mhf' | 'mh-' | 'mh-' = | 'mh-' = | 'mh-' This would make the code in clause_is_mv_compatible like this. oprmvstats = get_mvstatsset(expr-opno); /* bitwise representation */ if (oprmvstats types) { *attnums = bms_add_member(*attnums, var-varattno); return true; } return false; - Current design just manage to work but it is too complicated and hardly have affinity with the existing estimation framework. I proposed separation of finding stats phase and calculation phase, but I would like to propose transforming RestrictInfo(and finding mvstat) phase and running the transformed RestrictInfo phase after looking close to the patch. I think transforing RestrictInfo makes the situnation better. Since it nedds different information, maybe it is better to have new struct, say, RestrictInfoForEstimate (boo!). Then provide mvstatssel() to use in the new struct. The rough looking of the code would be like below. clauselist_selectivity() { ... RestrictInfoForEstmate *esclause = transformClauseListForEstimation(root, clauses, varRelid); ... return clause_selectivity(esclause): } clause_selectivity(RestrictInfoForEstmate *esclause) { if (IsA(clause, RestrictInfo))... if (IsA(clause, RestrictInfoForEstimate)) { RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause; if (ecl-selfunc) { sx = ecl-selfunc(root, ecl); } } if (IsA(clause, Var))... } transformClauseListForEstimation(...) { ... relid = collect_mvstats_info(root, clause, attlist); if (!relid) return; if (get_mvstats_hook) mvstats = (*get_mvstats_hoook) (root, relid, attset); else mvstats = find_mv_stats(root, relid, attset)) } ... I've pushed this to github [1] but I need to do some additional fixes. I also had to remove some optimizations while fixing this, and will have to reimplement those. That's not to say that the handling of OR-clauses is perfectly correct. After looking at clauselist_selectivity_or(), I believe it's a bit broken and will need a bunch of fixes, as explained in the FIXMEs I pushed to github. [1] https://github.com/tvondra/postgres/tree/mvstats I don't see whether it is doable or not, and I suppose you're unwilling to change the big picture, so I will consider the idea and will show you the result, if it turns out to be possible and promising. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list
Re: [HACKERS] multivariate statistics / patch v7
Hi, Tomas. I'll kick the gas pedal. Thank you, it looks clearer. I have some comment for the brief look at this. This patchset is relatively large so I will comment on per-notice basis.. which means I'll send comment before examining the entire of this patchset. Sorry in advance for the desultory comments. Sure. If you run into something that's not clear enough, I'm happy to explain that (I tried to cover all the important details in the comments, but it's a large patch, indeed.) - Single-variate stats have a mechanism to inject arbitrary values as statistics, that is, get_relation_stats_hook and the similar stuffs. I want the similar mechanism for multivariate statistics, too. Fair point, although I'm not sure where should we place the hook, how exactly should it be defined and how useful that would be in the end. Can you give an example of how you'd use such hook? It's my secret, but is open:p. this is crucial for us to examine many planner-related problems occurred in our customer in-vitro. http://pgdbmsstats.osdn.jp/pg_dbms_stats-en.html # Mmm, this doc is a bit too old.. One tool of ours does like following, - Copy pg_statistics and some attributes of pg_class into some table. Of course this is exportable. - For example, in examine_simple_variable, using the hook get_relation_stats_hook, inject the saved statistics in place of the real statistics. The hook point is placed where the parameters to specify what statistics is needed are avaiable in compact shape, and all the hook function should do is returning corresponding statistics values. So the parallel stuff for this mv stats will look like this. MVStatisticInfo * get_mv_statistics(PlannerInfo *root, relid); or MVStatisticInfo * get_mv_statistics(PlannerInfo *root, relid, bitmap or list of attnos); So by simplly applying this, the current clauselist_selectivity code will turn into following. if (list_length(clauses) == 1) return clause_selectivity(); Index varrelid = find_singleton_relid(root, clauses, varRelid); if (varrelid) { // Bitmapset attnums = collect_attnums(root, clauses, varrelid); if (get_mv_statistics_hook) stats = get_mv_statistics_hook(root, varrelid /*, attnums */); else statis = get_mv_statistics(root, varrelid /*, attnums*/); In comparison to single statistics, statistics values might be preferable to separate from definition. I've never used get_relation_stats_hook, but if I get it right, the plugins can use the hook to create the stats (for each column), either from scratch or tweaking the existing stats. Mostly existing stats without change. I saw few hackers wanted to provide predefined statistics for typical cases. I haven't see anyone who tweaks existing stats. I'm not sure how this should work with multivariate stats, though, because there can be arbitrary number of stats for a column, and it really depends on all the clauses (so examine_variable() seems a bit inappropriate, as it only sees a single variable at a time). Restriction clauses are not a problem. What is needed to replace stats value is defining few APIs to retrieve them, and to retrieve the stats values only in a way that compatible with the API. It would be okay to be a substitute views for mv stats as an extreme case but it is not good. Moreover, with multivariate stats (a) there may be arbitrary number of stats for a column (b) only some of the stats end up being used for the estimation I see two or three possible places for calling such hook: (a) at the very beginning, after fetching the list of stats - sees all the existing stats on a table - may add entirely new stats or tweak the existing ones Getting all stats for a table would be okay but attnum list can restrict the possibilities, as the second form of the example APIs above. And we may forget the case of forged or tweaked stats, they are their problem, not ours. (b) after collecting the list of variables compatible with multivariate stats - like (a) and additionally knows which columns are interesting for the query (but only with respect to the existing stats) We should carefully design the API to be able to point the pertinent stats for every situation. Mv stats is based on the correlation of multiple columns so I think only relid and attributes list are enough as the parameter. | if (st.relid == param.relid bms_equal(st.attnums, param.attnums)) |/* This is the stats to be wanted */ If we can filter the appropriate stats from all the stats using clauselist, we definitely can make the appropriate parameter (column set) prior to retrieving mv statistics. Isn't it correct? (c) after optimization (selection of the right combination if stats) - like (b), but can't affect the optimization But I can't really imagine anyone building multivariate stats on the fly, in the hook. It's more
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 07/07/2015 08:05 AM, Kyotaro HORIGUCHI wrote: Hi, Tomas. I'll kick the gas pedal. Thank you, it looks clearer. I have some comment for the brief look at this. This patchset is relatively large so I will comment on per-notice basis.. which means I'll send comment before examining the entire of this patchset. Sorry in advance for the desultory comments. Sure. If you run into something that's not clear enough, I'm happy to explain that (I tried to cover all the important details in the comments, but it's a large patch, indeed.) - Single-variate stats have a mechanism to inject arbitrary values as statistics, that is, get_relation_stats_hook and the similar stuffs. I want the similar mechanism for multivariate statistics, too. Fair point, although I'm not sure where should we place the hook, how exactly should it be defined and how useful that would be in the end. Can you give an example of how you'd use such hook? ... We should carefully design the API to be able to point the pertinent stats for every situation. Mv stats is based on the correlation of multiple columns so I think only relid and attributes list are enough as the parameter. | if (st.relid == param.relid bms_equal(st.attnums, param.attnums)) |/* This is the stats to be wanted */ If we can filter the appropriate stats from all the stats using clauselist, we definitely can make the appropriate parameter (column set) prior to retrieving mv statistics. Isn't it correct? Let me briefly explain how the current clauselist_selectivity implementation works. (1) check if there are multivariate statistics on the table - if not, skip the multivariate parts altogether (the point of this is to minimize impact on users who don't use the new feature) (2) see if the are clauses compatible with multivariate stats - this only checks general compatibility without actually checking the existing stats (the point is to terminate early, if the clauses are not compatible somehow - e.g. if the clauses reference only a single attribute, use unsupported operators etc.) (3) if there are multivariate stats and compatible clauses, the function choose_mv_stats tries to find the best combination of multivariate stats with respect to the clauses (details later) (4) the clauses are estimated using the stats, the remaining clauses are estimated using the current statistics (single attribute) The only way to reliably inject new stats is by calling a hook before (1), allowing it to arbitrarily modify the list of stats. Based on the use cases you provided, I don't think it makes much sense to add additional hooks in the other phases. At this place it's however now known what clauses are compatible with multivariate stats, or what attributes they are referencing. It might be possible to simply call pull_varattnos() and pass it to the hook, except that does not work with RestrictInfo :-/ Or maybe we could / should not put the hook into clauselist_selectivity but somewhere else? Say, to get_relation_info where we actually read the list of stats for the relation? 0001: - I also don't think it is right thing for expression_tree_walker to recognize RestrictInfo since it is not a part of expression. Yes. In my working git repo, I've reworked this to use the second option, i.e. adding RestrictInfo pull_(varno|varattno)_walker: https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f Do you think this is the correct solution? If not, how to fix it? The reason why I think it is not appropreate is that RestrictInfo is not a part of expression. Increasing selectivity of a condition by column correlation is occurs only for a set of conjunctive clauses. OR operation devides the sets. Is it agreeable? RestrictInfos can be nested each other and we should be aware of the AND/OR operators. This is what expression_tree_walker doesn't. I still don't understand why you think we need to differentiate between AND and OR operators. There's nothing wrong with estimating OR clauses using multivariate statistics. Perhaps we should provide the dedicate function such like find_conjunctive_attr_set which does this, Perhaps. The reason why I added support for RestrictInfo into the existing walker implementations is that it seemed like the easiest way to fix the issue. But if there are reasons why that's incorrect, then inventing a new function is probably the right way. - Check the type top expression of the clause - If it is a RestrictInfo, check clause_relids then check clause. - If it is a bool OR, stop to search and return empty set of attributes. - If it is a bool AND, make further check of the components. A list of RestrictInfo should be treaed as AND connection. - If it is operator exression, collect used relids and attrs walking the expression tree. I should missing something
Re: [HACKERS] multivariate statistics / patch v7
Hello Horiguchi-san! On 07/07/2015 09:43 PM, Tomas Vondra wrote: -- histograms ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c); ANALYZE t; EXPLAIN ANALYZE select * from t where a 0.3 and b 0.3 and c 0.3; Seq Scan on t (cost=0.00..23870.00 rows=267033 width=24) (actual time=0.021..330.487 rows=273897 loops=1) EXPLAIN ANALYZE select * from t where a 0.3 and b 0.3 or c 0.3; Seq Scan on t (cost=0.00..23870.00 rows=14317 width=24) (actual time=0.027..906.321 rows=966870 loops=1) EXPLAIN ANALYZE select * from t where a 0.3 and b 0.3 or c 0.9; Seq Scan on t (cost=0.00..23870.00 rows=20367 width=24) (actual time=0.028..452.494 rows=393528 loops=1) This seems wrong, because the estimate for the OR queries should not be lower than the estimate for the first query (with just AND), and it should not increase when increasing the boundary. I'd bet this is a bug in how the inequalities are handled with histograms, or how the AND/OR clauses are combined. I'll look into that. FWIW this was a stupid bug in update_match_bitmap_histogram(), which initially handled only AND clauses, and thus assumed the match of a bucket can only decrease. But for OR clauses this is exactly the opposite (we assume no buckets match and add buckets matching at least one of the clauses). With this fixed, the estimates look like this: EXPLAIN ANALYZE select * from t where a 0.3 and b 0.3 and c 0.3; Seq Scan on t (cost=0.00..23870.00 rows=267033 width=24) (actual time=0.102..321.524 rows=273897 loops=1) EXPLAIN ANALYZE select * from t where a 0.3 and b 0.3 or c 0.3; Seq Scan on t (cost=0.00..23870.00 rows=319400 width=24) (actual time=0.103..386.089 rows=317533 loops=1) EXPLAIN ANALYZE select * from t where a 0.3 and b 0.3 or c 0.3; Seq Scan on t (cost=0.00..23870.00 rows=956833 width=24) (actual time=0.133..908.455 rows=966870 loops=1) EXPLAIN ANALYZE select * from t where a 0.3 and b 0.3 or c 0.9; Seq Scan on t (cost=0.00..23870.00 rows=393633 width=24) (actual time=0.105..440.607 rows=393528 loops=1) IMHO pretty accurate estimates - no issue with OR clauses. I've pushed this to github [1] but I need to do some additional fixes. I also had to remove some optimizations while fixing this, and will have to reimplement those. That's not to say that the handling of OR-clauses is perfectly correct. After looking at clauselist_selectivity_or(), I believe it's a bit broken and will need a bunch of fixes, as explained in the FIXMEs I pushed to github. [1] https://github.com/tvondra/postgres/tree/mvstats kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
Hello Horiguchi-san! On 07/03/2015 07:30 AM, Kyotaro HORIGUCHI wrote: Hello, I started to work on this patch. attached is v7 of the multivariate stats patch. The main improvement is major refactoring of the clausesel.c portion - splitting the awfully long spaghetti-style functions into smaller pieces, making it much more understandable etc. Thank you, it looks clearer. I have some comment for the brief look at this. This patchset is relatively large so I will comment on per-notice basis.. which means I'll send comment before examining the entire of this patchset. Sorry in advance for the desultory comments. Sure. If you run into something that's not clear enough, I'm happy to explain that (I tried to cover all the important details in the comments, but it's a large patch, indeed.) === General comments: - You included unnecessary stuffs such like regression.diffs in these patches. A :-/ Will fix. - Now OID 3307 is used by pg_stat_file. I moved pg_mv_stats_dependencies_info/show to 3311/3312. Will fix while rebasing to current master. - Single-variate stats have a mechanism to inject arbitrary values as statistics, that is, get_relation_stats_hook and the similar stuffs. I want the similar mechanism for multivariate statistics, too. Fair point, although I'm not sure where should we place the hook, how exactly should it be defined and how useful that would be in the end. Can you give an example of how you'd use such hook? I've never used get_relation_stats_hook, but if I get it right, the plugins can use the hook to create the stats (for each column), either from scratch or tweaking the existing stats. I'm not sure how this should work with multivariate stats, though, because there can be arbitrary number of stats for a column, and it really depends on all the clauses (so examine_variable() seems a bit inappropriate, as it only sees a single variable at a time). Moreover, with multivariate stats (a) there may be arbitrary number of stats for a column (b) only some of the stats end up being used for the estimation I see two or three possible places for calling such hook: (a) at the very beginning, after fetching the list of stats - sees all the existing stats on a table - may add entirely new stats or tweak the existing ones (b) after collecting the list of variables compatible with multivariate stats - like (a) and additionally knows which columns are interesting for the query (but only with respect to the existing stats) (c) after optimization (selection of the right combination if stats) - like (b), but can't affect the optimization But I can't really imagine anyone building multivariate stats on the fly, in the hook. It's more complicated, though, because the query may call clauselist_selectivity multiple times, depending on how complex the WHERE clauses are. 0001: - I also don't think it is right thing for expression_tree_walker to recognize RestrictInfo since it is not a part of expression. Yes. In my working git repo, I've reworked this to use the second option, i.e. adding RestrictInfo pull_(varno|varattno)_walker: https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f Do you think this is the correct solution? If not, how to fix it? 0003: - In clauselist_selectivity, find_stats is uselessly called for single clause. This should be called after the clauselist found to consist more than one clause. Ok, will fix. - Searching vars to be compared with mv-stat columns which find_stats does should stop at disjunctions. But this patch doesn't behave so and it should be an unwanted behavior. The following steps shows that. Why should it stop at disjunctions? There's nothing wrong with using multivariate stats to estimate OR-clauses, IMHO. =# CREATE TABLE t1 (a int, b int, c int); =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, ) a); =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; Seq Scan on t1 (cost=0.00..230.00 rows=1 width=12) =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c); =# ANALZYE t1; =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; Seq Scan on t1 (cost=0.00..230.00 rows=268 width=12) Rows changed unwantedly. That has nothing to do with OR clauses, but rather with using a type of statistics that does not fit the data and queries. Histograms are quite inaccurate for discrete data and equality conditions - in this case the clauses probably match one bucket, and so we use 1/2 the bucket as an estimate. There's nothing wrong with that. So let's use MCV instead: ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c); ANALYZE t1; EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; QUERY PLAN - Seq Scan on t1 (cost=0.00..230.00 rows=1
Re: [HACKERS] multivariate statistics / patch v7
Hello, I started to work on this patch. attached is v7 of the multivariate stats patch. The main improvement is major refactoring of the clausesel.c portion - splitting the awfully long spaghetti-style functions into smaller pieces, making it much more understandable etc. Thank you, it looks clearer. I have some comment for the brief look at this. This patchset is relatively large so I will comment on per-notice basis.. which means I'll send comment before examining the entire of this patchset. Sorry in advance for the desultory comments. === General comments: - You included unnecessary stuffs such like regression.diffs in these patches. - Now OID 3307 is used by pg_stat_file. I moved pg_mv_stats_dependencies_info/show to 3311/3312. - Single-variate stats have a mechanism to inject arbitrary values as statistics, that is, get_relation_stats_hook and the similar stuffs. I want the similar mechanism for multivariate statistics, too. 0001: - I also don't think it is right thing for expression_tree_walker to recognize RestrictInfo since it is not a part of expression. 0003: - In clauselist_selectivity, find_stats is uselessly called for single clause. This should be called after the clauselist found to consist more than one clause. - Searching vars to be compared with mv-stat columns which find_stats does should stop at disjunctions. But this patch doesn't behave so and it should be an unwanted behavior. The following steps shows that. =# CREATE TABLE t1 (a int, b int, c int); =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, ) a); =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; Seq Scan on t1 (cost=0.00..230.00 rows=1 width=12) =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c); =# ANALZYE t1; =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; Seq Scan on t1 (cost=0.00..230.00 rows=268 width=12) Rows changed unwantedly. It seems not so simple thing as your code assumes. I do assume some of those pieces are unnecessary because there already is a helper function with the same purpose (but I'm not aware of that). But IMHO this piece of code begins to look reasonable (especially when compared to the previous state). Year, such kind of work should be done later:p This patch is not-so-invasive so as to make it undoable. The other major improvement it review of the comments (including FIXMEs and TODOs), and removal of the obsolete / misplaced ones. And there was plenty of those ... These changes made this version ~20k smaller than v6. The patch also rebases to current master, which I assume shall be quite stable - so hopefully no more duplicate OIDs for a while. There are 6 files attached, but only 0002-0006 are actually part of the multivariate statistics patch itself. The first part makes it possible to use pull_varnos() with expression trees containing RestrictInfo nodes, but maybe this is not the right way to fix this (there's another thread where this was discussed). As mentioned above, checking if mv stats can be applied would be more complex matter than now you are assuming. I also will consider that. Also, the regression tests testing plan choice with multivariate stats (e.g. that a bitmap index scan is chosen instead of index scan) fail from time to time. I suppose this happens because the invalidation after ANALYZE is not processed before executing the query, so the optimizer does not see the stats, or something like that. I saw that occurs, but have no idea how it occurs so far.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers