Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Mar 10, 2015 at 4:03 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas robertmh...@gmail.com wrote: I'm prepared to commit this version if nobody finds a problem with it. It passes the additional regression tests you wrote. Looks good to me. Thanks. OK, committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Dec 22, 2014 at 7:34 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas robertmh...@gmail.com wrote: Looking over the latest patch, I think we could simplify the code so that you don't need multiple FuzzyAttrMatchState objects. Instead of creating a separate one for each RTE and then merging them, just have one. When you find an inexact-RTE name match, set a field inside the FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the Levenshtein distance between the RTEs. Then call scanRTEForColumn() and pass down the same state object. Now let updateFuzzyAttrMatchState() work out what it needs to do based on the observed inter-column distance and the currently-in-force RTE penalty. I'm afraid I don't follow. I think doing things that way makes things less clear. Merging is useful because it allows us to consider that an exact match might exist, which this searchRangeTableForCol() is already tasked with today. We now look for the best match exhaustively, or magically return immediately in the event of an exact match, without caring about the alias correctness or distance. Having a separate object makes this pattern apparent from the top level, within searchRangeTableForCol(). I feel that's better. updateFuzzyAttrMatchState() is the wrong place to put that, because that task rightfully belongs in searchRangeTableForCol(), where the high level diagnostic-report-generating control flow lives. To put it another way, creating a separate object obfuscates scanRTEForColumn(), since it's the only client of updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important function, and right now I am only making it slightly less clear by tasking it with caring about distance of names on top of strict binary equality of attribute names. I don't want to push it any further. I don't buy this. What you're essentially doing is using the FuzzyAttrMatchState object in two ways that are not entirely compatible with each other - updateFuzzyAttrMatchState doesn't set the RTE fields, so searchRangeTableForCol has to do it. So there's an unspoken contract that in some parts of the code, you can rely on those fields being set, and in others, you can't. That pretty much defeats the whole point of making the state its own object, AFAICS. Furthermore, you end up with two copies of the state-combining logic, one in FuzzyAttrMatchState and a second in searchRangeTableForCol. That's ugly and unnecessary. I decided to rework this patch myself today; my version is attached. I believe that this version is significantly easier to understand than yours, both as to the code and the comments. I put quite a bit of work into both. I also suspect it's more efficient, because it avoids computing the Levenshtein distances for column names when we already know that those column names can't possibly be sufficiently-good matches for us to care about the details; and when it does compute the Levenshtein distance it keeps the max-distance threshold as low as possible. That may not really matter much, but it can't hurt. More importantly, essentially all of the fuzzy-matching logic is now isolated in FuzzyAttrMatchState(); the volume of change in scanRTEForColumn is the same as in your version, but the volume of change in searchRangeTableForCol is quite a bit less, so the patch is smaller overall. I'm prepared to commit this version if nobody finds a problem with it. It passes the additional regression tests you wrote. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company column-hint-rmh.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas robertmh...@gmail.com wrote: I'm prepared to commit this version if nobody finds a problem with it. It passes the additional regression tests you wrote. Looks good to me. Thanks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Dec 23, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan p...@heroku.com wrote: To put it another way, creating a separate object obfuscates scanRTEForColumn(), since it's the only client of updateFuzzyAttrMatchState(). Excuse me. I mean *not* creating a separate object -- having a unified state representation for the entire range-table, rather than having one per RTE and merging them one by one into an overall/final range table object. Patch moved to CF 2015-02. -- Michael
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sat, Dec 20, 2014 at 7:30 PM, Peter Geoghegan p...@heroku.com wrote: On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch implements this scheme. I had another thought: NAMEDATALEN + 1 is a better representation of infinity for matching purposes than INT_MAX. I probably should have made that change, too. It would then not have been necessary to #include limits.h. I think that this is a useful belt-and-suspenders precaution against integer overflow. It almost certainly won't matter, since it's very unlikely that the best match within an RTE will end up being a dropped column, but we might as well do it that way (Levenshtein distance is costed in multiples of code point changes, but the maximum density is 1 byte per codepoint). Good idea. Looking over the latest patch, I think we could simplify the code so that you don't need multiple FuzzyAttrMatchState objects. Instead of creating a separate one for each RTE and then merging them, just have one. When you find an inexact-RTE name match, set a field inside the FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the Levenshtein distance between the RTEs. Then call scanRTEForColumn() and pass down the same state object. Now let updateFuzzyAttrMatchState() work out what it needs to do based on the observed inter-column distance and the currently-in-force RTE penalty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas robertmh...@gmail.com wrote: Looking over the latest patch, I think we could simplify the code so that you don't need multiple FuzzyAttrMatchState objects. Instead of creating a separate one for each RTE and then merging them, just have one. When you find an inexact-RTE name match, set a field inside the FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the Levenshtein distance between the RTEs. Then call scanRTEForColumn() and pass down the same state object. Now let updateFuzzyAttrMatchState() work out what it needs to do based on the observed inter-column distance and the currently-in-force RTE penalty. I'm afraid I don't follow. I think doing things that way makes things less clear. Merging is useful because it allows us to consider that an exact match might exist, which this searchRangeTableForCol() is already tasked with today. We now look for the best match exhaustively, or magically return immediately in the event of an exact match, without caring about the alias correctness or distance. Having a separate object makes this pattern apparent from the top level, within searchRangeTableForCol(). I feel that's better. updateFuzzyAttrMatchState() is the wrong place to put that, because that task rightfully belongs in searchRangeTableForCol(), where the high level diagnostic-report-generating control flow lives. To put it another way, creating a separate object obfuscates scanRTEForColumn(), since it's the only client of updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important function, and right now I am only making it slightly less clear by tasking it with caring about distance of names on top of strict binary equality of attribute names. I don't want to push it any further. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan p...@heroku.com wrote: To put it another way, creating a separate object obfuscates scanRTEForColumn(), since it's the only client of updateFuzzyAttrMatchState(). Excuse me. I mean *not* creating a separate object -- having a unified state representation for the entire range-table, rather than having one per RTE and merging them one by one into an overall/final range table object. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Dec 16, 2014 at 12:18 PM, Stephen Frost sfr...@snowman.net wrote: * Tom Lane (t...@sss.pgh.pa.us) wrote: So my two cents is that when considering a qualified name, this patch should take levenshtein distance across the two components equally. There's no good reason to suppose that typos will attack one name component more (nor less) than the other. Agreed (since it seems like folks are curious for the opinion's of mostly bystanders). +1 to the above for my part. Okay, then. Attached patch implements this scheme. It is identical to the previous revision, except that iff there was an alias specified and that alias does not match the correct name (alias/table name) of the RTE currently under consideration, we charge the distance between the differing aliases rather than a fixed distance of 1. -- Peter Geoghegan From 91087191ba49e5fed7fdfa43f98deb009c2b3e0e Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Wed, 12 Nov 2014 15:31:37 -0800 Subject: [PATCH] Levenshtein distance column HINT Add a new HINT -- a guess as to what column the user might have intended to reference, to be shown in various contexts where an ERRCODE_UNDEFINED_COLUMN error is raised. The user will see this HINT when he or she fat-fingers a column reference in an ad-hoc SQL query, or incorrectly pluralizes or fails to pluralize a column reference, or incorrectly omits or includes an underscore or other punctuation character. The HINT suggests a column in the range table with the lowest Levenshtein distance, or the tied-for-best pair of matching columns in the event of there being exactly two equally likely candidates (these may come from multiple RTEs, or the same RTE). Limiting to two the number of cases where multiple equally likely suggestions are all offered at once (i.e. giving no hint when the number of equally likely candidates exceeds two) is a measure against suggestions that are of low quality in an absolute sense. A further, final measure is taken against suggestions that are of low absolute quality: If the distance exceeds a normalized distance threshold, no suggestion is given. --- src/backend/parser/parse_expr.c | 9 +- src/backend/parser/parse_func.c | 2 +- src/backend/parser/parse_relation.c | 325 +++--- src/backend/utils/adt/levenshtein.c | 9 + src/include/parser/parse_relation.h | 20 +- src/test/regress/expected/alter_table.out | 3 + src/test/regress/expected/join.out| 38 src/test/regress/sql/join.sql | 24 +++ 8 files changed, 392 insertions(+), 38 deletions(-) diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 4a8aaf6..a77a3a0 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref) colname = strVal(field2); /* Try to identify as a column of the RTE */ -node = scanRTEForColumn(pstate, rte, colname, cref-location); +node = scanRTEForColumn(pstate, rte, colname, cref-location, + NULL); if (node == NULL) { /* Try it as a function call on the whole row */ @@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref) colname = strVal(field3); /* Try to identify as a column of the RTE */ -node = scanRTEForColumn(pstate, rte, colname, cref-location); +node = scanRTEForColumn(pstate, rte, colname, cref-location, + NULL); if (node == NULL) { /* Try it as a function call on the whole row */ @@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref) colname = strVal(field4); /* Try to identify as a column of the RTE */ -node = scanRTEForColumn(pstate, rte, colname, cref-location); +node = scanRTEForColumn(pstate, rte, colname, cref-location, + NULL); if (node == NULL) { /* Try it as a function call on the whole row */ diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c index 9ebd3fd..472e15e 100644 --- a/src/backend/parser/parse_func.c +++ b/src/backend/parser/parse_func.c @@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg, ((Var *) first_arg)-varno, ((Var *) first_arg)-varlevelsup); /* Return a Var if funcname matches a column, else NULL */ - return scanRTEForColumn(pstate, rte, funcname, location); + return scanRTEForColumn(pstate, rte, funcname, location, NULL); } /* diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c index 478584d..e6adee1 100644 --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -15,6 +15,7 @@ #include postgres.h #include ctype.h +#include limits.h #include access/htup_details.h #include access/sysattr.h @@ -520,6 +521,69 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch implements this scheme. I had another thought: NAMEDATALEN + 1 is a better representation of infinity for matching purposes than INT_MAX. I probably should have made that change, too. It would then not have been necessary to #include limits.h. I think that this is a useful belt-and-suspenders precaution against integer overflow. It almost certainly won't matter, since it's very unlikely that the best match within an RTE will end up being a dropped column, but we might as well do it that way (Levenshtein distance is costed in multiples of code point changes, but the maximum density is 1 byte per codepoint). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
* Tom Lane (t...@sss.pgh.pa.us) wrote: So my two cents is that when considering a qualified name, this patch should take levenshtein distance across the two components equally. There's no good reason to suppose that typos will attack one name component more (nor less) than the other. Agreed (since it seems like folks are curious for the opinion's of mostly bystanders). +1 to the above for my part. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote: I think it's very possible that the wrong alias may be provided by the user, and that we should consider that when providing a hint. Note that the existing mechanism (the mechanism that I'm trying to improve) only ever shows this error message: There is a column named \%s\ in table \%s\, but it cannot be referenced from this part of the query. I think it's pretty clear that this general class of user error is common. Moving this patch to CF 2014-12 as work is still going on, note that it is currently marked with Robert as reviewer and that its current status is Needs review. The status here is more like waiting around to see if anyone else has an opinion. The issue is what should happen when you enter qualified name like alvaro.herrera and there is no column named anything like herrara in the RTE named alvaro, but there is some OTHER RTE that contains a column with a name that is only a small Levenshtein distance away from herrera, like roberto.correra. The questions are: 1. Should we EVER give a you-might-have-meant hint in a case like this? 2. If so, does it matter whether the RTE name is just a bit different from the actual RTE or whether it's completely different? In other words, might we skip the hint in the above case but give one for alvara.correra? My current feeling is that we should answer #1 no, but Peter prefers to answer it yes. My further feeling is that if we do decide to say yes to #1, then I would answer #2 as yes also, but Peter would answer it no, assigning a fixed penalty for a mismatched RTE rather than one that varies by the Levenshtein distance between the RTEs. If no one else expresses an opinion, I'm going to insist on doing it my way, but I'm happy to have other people weigh in. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Robert Haas robertmh...@gmail.com writes: On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier michael.paqu...@gmail.com wrote: Moving this patch to CF 2014-12 as work is still going on, note that it is currently marked with Robert as reviewer and that its current status is Needs review. The status here is more like waiting around to see if anyone else has an opinion. The issue is what should happen when you enter qualified name like alvaro.herrera and there is no column named anything like herrara in the RTE named alvaro, but there is some OTHER RTE that contains a column with a name that is only a small Levenshtein distance away from herrera, like roberto.correra. The questions are: 1. Should we EVER give a you-might-have-meant hint in a case like this? 2. If so, does it matter whether the RTE name is just a bit different from the actual RTE or whether it's completely different? In other words, might we skip the hint in the above case but give one for alvara.correra? It would be astonishingly silly to not care about the RTE name's distance, if you ask me. This is supposed to detect typos, not thinkos. I think there might be some value in a separate heuristic that, when you typed foo.bar and that doesn't match but there is a baz.bar, suggests that maybe you meant baz.bar, even if baz is not close typo-wise. This would be addressing the thinko case not the typo case, so the rules ought to be quite different --- in particular I doubt that it'd be a good idea to hint this way if the column names don't match exactly. But in any case the key point is that this is a different heuristic addressing a different failure mode. We should not try to make the levenshtein-distance heuristic address that case. So my two cents is that when considering a qualified name, this patch should take levenshtein distance across the two components equally. There's no good reason to suppose that typos will attack one name component more (nor less) than the other. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote: I think it's very possible that the wrong alias may be provided by the user, and that we should consider that when providing a hint. Note that the existing mechanism (the mechanism that I'm trying to improve) only ever shows this error message: There is a column named \%s\ in table \%s\, but it cannot be referenced from this part of the query. I think it's pretty clear that this general class of user error is common. Moving this patch to CF 2014-12 as work is still going on, note that it is currently marked with Robert as reviewer and that its current status is Needs review. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Fri, Dec 5, 2014 at 3:45 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas robertmh...@gmail.com wrote: Well, if an alias is used, and you refer to an attribute using a non-alias name (i.e. the original table name), then you'll already get an error suggesting that the alias be used instead -- of course, that's nothing new. It doesn't matter to the existing hinting mechanism if the attribute name is otherwise wrong. Once you fix the code to use the alias suggested, you'll then get this new Levenshtein-based hint. In that case, I think I favor giving no hint at all when the RTE name is specified but doesn't match exactly. I don't follow. The existing mechanism only concerns what to do when the original table name was used when an alias should have been used instead. What does that have to do with this patch? Just that that's the case in which it seems useful to give a hint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas robertmh...@gmail.com wrote: Just that that's the case in which it seems useful to give a hint. I think it's very possible that the wrong alias may be provided by the user, and that we should consider that when providing a hint. Besides, considering every visible RTE (while penalizing non-exact alias names iff the user provided an alias name) is actually going to make bad hints less likely, by increasing the number of equidistant low quality matches in a way that swamps the mechanism into providing no actual match at all. That's an important additional protection against low quality matches. What do other people think here? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote: I think it's very possible that the wrong alias may be provided by the user, and that we should consider that when providing a hint. Note that the existing mechanism (the mechanism that I'm trying to improve) only ever shows this error message: There is a column named \%s\ in table \%s\, but it cannot be referenced from this part of the query. I think it's pretty clear that this general class of user error is common. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Dec 3, 2014 at 9:21 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas robertmh...@gmail.com wrote: Basically, the case in which I think it's helpful to issue a suggestion here is when the user has used the table name rather than the alias name. I wonder if it's worth checking for that case specifically, in lieu of what you've done here, and issuing a totally different hint in that case (HINT: You must refer to this as column as prime_minister.id rather than cameron.id). Well, if an alias is used, and you refer to an attribute using a non-alias name (i.e. the original table name), then you'll already get an error suggesting that the alias be used instead -- of course, that's nothing new. It doesn't matter to the existing hinting mechanism if the attribute name is otherwise wrong. Once you fix the code to use the alias suggested, you'll then get this new Levenshtein-based hint. In that case, I think I favor giving no hint at all when the RTE name is specified but doesn't match exactly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas robertmh...@gmail.com wrote: Well, if an alias is used, and you refer to an attribute using a non-alias name (i.e. the original table name), then you'll already get an error suggesting that the alias be used instead -- of course, that's nothing new. It doesn't matter to the existing hinting mechanism if the attribute name is otherwise wrong. Once you fix the code to use the alias suggested, you'll then get this new Levenshtein-based hint. In that case, I think I favor giving no hint at all when the RTE name is specified but doesn't match exactly. I don't follow. The existing mechanism only concerns what to do when the original table name was used when an alias should have been used instead. What does that have to do with this patch? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas robertmh...@gmail.com wrote: Basically, the case in which I think it's helpful to issue a suggestion here is when the user has used the table name rather than the alias name. I wonder if it's worth checking for that case specifically, in lieu of what you've done here, and issuing a totally different hint in that case (HINT: You must refer to this as column as prime_minister.id rather than cameron.id). Well, if an alias is used, and you refer to an attribute using a non-alias name (i.e. the original table name), then you'll already get an error suggesting that the alias be used instead -- of course, that's nothing new. It doesn't matter to the existing hinting mechanism if the attribute name is otherwise wrong. Once you fix the code to use the alias suggested, you'll then get this new Levenshtein-based hint. Another idea, which I think I like less well, is to check the Levenshtein distance between the allowed alias and the entered alias and, if that's within the half-the-shorter-length threshold, consider possible matches from that RTE, charge the distance between the correct alias and the entered alias as a penalty to each potential column match. I don't about that either. Aliases are often totally arbitrary, particularly for ad-hoc queries, which is what this is aimed at. What I think won't do is to look at a situation where the user has entered automobile.id and suggest that maybe they meant student.iq, or even student.id. I'm not sure I follow. If there is an automobile.ip, then it will be suggested. If there is no automobile column that's much of a match (so no automobile.ip, say), then student.id will be suggested (and not student.iq, *even if there is no student.id* - the final quality check saves us). So this is possible: postgres=# select iq, * from student, automobile; ERROR: 42703: column iq does not exist LINE 1: select iq, * from student, automobile; ^ HINT: Perhaps you meant to reference the column student.id. postgres=# select automobile.iq, * from student, automobile; ERROR: 42703: column automobile.iq does not exist LINE 1: select automobile.iq, * from student, automobile; ^ (note that using the table name makes us *not* see a suggestion where we otherwise would). The point is that there is a fixed penalty for a wrong user-specified alias, but all relation RTEs are considered. The amount of difference between the names has got to matter for the RTE names, just as it does for the column names. I think it makes sense that it matters by a fixed amount. Besides, this seems complicated enough already - I don't won't to add more complexity to worry about equidistant (but still actually valid) RTE/table/alias names. It sounds like your concern here is mostly a concern about the relative distance among multiple matches, as opposed to the absolute quality of suggestions. The former seems a lot less controversial than the latter was, though - the user always gets the best match, or the join pair of best matches, or no match when this new hinting mechanism is involved. I attach a new revision. The revision: * Uses default costs for Levenshtein distance. * Still charges extra for a non-alias-matching match (although it only charges a fixed distance of 1 extra). This has regression test coverage. * Applies a generic final quality check that enforces a requirement that a hint have a distance of no greater than 50% of the total string size. No special treatment of shorter strings is involved anymore. * Moves almost everything out of scanRTEForColumn() as you outlined (into a new function, updateFuzzyAttrMatchState(), per your suggestion). * Moves dropped column detection into updateFuzzyAttrMatchState(), per your suggestion. * Still does the if (rte-rtekind == RTE_JOIN) thing in the existing function searchRangeTableForCol(). I am quite confident that a suggestion from a join RTE will never be useful, to either the existing use of searchRangeTableForCol() or this expanded use, and it makes more sense to me to put it there. In fact, the existing use of searchRangeTableForCol() is really rather similar to this, and will give up on the first identical match (which is taken as evidence that there is a attribute of that name, but isn't visible at this level of the query). So I have not followed your suggestion here. Thoughts? -- Peter Geoghegan From 81c7b0691e9d03c1bdd99f4b264737306d1bd2cf Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Wed, 12 Nov 2014 15:31:37 -0800 Subject: [PATCH] Levenshtein distance column HINT Add a new HINT -- a guess as to what column the user might have intended to reference, to be shown in various contexts where an ERRCODE_UNDEFINED_COLUMN error is raised. The user will see this HINT when he or she fat-fingers a column reference in an ad-hoc SQL query, or incorrectly pluralizes or fails to pluralize a column reference, or incorrectly omits or
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Nov 25, 2014 at 7:13 PM, Peter Geoghegan p...@heroku.com wrote: Alright, so let me summarize what I think are the next steps in working towards getting this patch committed. I should produce a new revision which: * Uses default costings. * Applies a generic final quality check that enforces a distance of no greater than 50% of the total string size. (The use of default costings removes any reason to continue to do this) * Work through Robert's suggestions on other aspects that need work [1], most of which I already agreed to. Sounds good so far. What is unclear is whether or not I should continue to charge extra for non-matching user supplied alias (and, I think more broadly, consider multiple RTEs iff the user did use an alias) - Robert was skeptical, but didn't seem to have made his mind up. I still think I should cost things based on aliases, and consider multiple RTEs even when the user supplied an alias (the penalty should just be a distance of 1 and not 3, though, in light of other changes to the weighing/costing). If I don't hear anything in the next day or two, I'll more or less preserve aliases-related aspects of the patch. Basically, the case in which I think it's helpful to issue a suggestion here is when the user has used the table name rather than the alias name. I wonder if it's worth checking for that case specifically, in lieu of what you've done here, and issuing a totally different hint in that case (HINT: You must refer to this as column as prime_minister.id rather than cameron.id). Another idea, which I think I like less well, is to check the Levenshtein distance between the allowed alias and the entered alias and, if that's within the half-the-shorter-length threshold, consider possible matches from that RTE, charge the distance between the correct alias and the entered alias as a penalty to each potential column match. What I think won't do is to look at a situation where the user has entered automobile.id and suggest that maybe they meant student.iq, or even student.id. The amount of difference between the names has got to matter for the RTE names, just as it does for the column names. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Nov 25, 2014 at 4:13 PM, Peter Geoghegan p...@heroku.com wrote: If I don't hear anything in the next day or two, I'll more or less preserve aliases-related aspects of the patch. FYI, I didn't go ahead and work on this, because I thought that the thanksgiving holiday in the US probably kept you from giving feedback. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 2:33 PM, Peter Geoghegan p...@heroku.com wrote: I don't think that's the case. Other RTEs are penalized for having non-matching aliases here. The point I made did not, as far as I can see, have anything to do with non-matching aliases; it could arise with just a single RTE. In general, I think the cost of a bad suggestion is much lower than the benefit of a good one. You seem to be suggesting that they're equal. Or that they're equally likely in an organic situation. In my estimation, this is not the case at all. The way I see it, the main cost of a bad suggestion is that it annoys the user with clutter which they may brand as stupid. Think about how much vitriol has been spewed over the years against progress bars (or estimated completion) times that don't turn out to mirror reality. Microsoft has gotten more cumulative flack about their inaccurate progress bars over the years than they would have for dropping an elevator on a cute baby. Or think about how annoying it is when a spell-checker or grammar-checker underlines something you've written that is, in your own opinion, correctly spelled or grammatical. Maybe that kind of thing doesn't annoy you, but it definitely annoys me, and I think probably a lot of other people. My general experience is that people get quite pissed off by bad suggestions from a computer. At least in my experience, users' actual level of agitation is often all out of proportion to what might seem justified, but we are designing this software for actual users, so their likely emotional reactions are relevant. I'm curious about your thoughts on the compromise of a ramped up distance threshold to apply a test for the absolute quality of a match. I think that the fact that git gives bad suggestions with terse strings tells us a lot, though. Note that unlike git, with terse strings we may well have a good deal more equidistant matches, and as soon as the number of would-be matches exceeds 2, we actually give no matches at all. So that's an additional protection against poor matches with terse strings. I don't know what you mean by a ramped-up distance threshold, exactly. I think it's good for the distance threshold to be lower for small strings and higher for large ones. I think I'm somewhat open to negotiation on the details, but I think any system that's going to suggest quantity for tit is going too far. If the user types qty when they meant quantity, they probably don't really need the hint, because they're going to say to themselves wait, I guess I didn't abbreviate that. The time when they need the hint is when they typed quanttiy, because it's quite possible to read a query with that sort of typo multiple times and not realize that you've made one. You're sitting there puzzling over where the quantity column went, and asking yourselves how you can be mis-remembering the schema, and saying wait, didn't I just see that column in the \d output ... and you don't even think to check carefully for a spelling mistake. The hint may well clue you in to what the real problem is. In other words, I think there's value in trying to clue somebody in when they've made a typo, but not when they've made a think-o. We won't be able to do the latter accurately enough to make it more useful than annoying. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Robert Haas robertmh...@gmail.com writes: ... In other words, I think there's value in trying to clue somebody in when they've made a typo, but not when they've made a think-o. We won't be able to do the latter accurately enough to make it more useful than annoying. FWIW, I concur with Robert's analysis that wrong suggestions are likely to be annoying. We should be erring on the side of not making a suggestion rather than making one that's a low-probability guess. I'm not particularly convinced that the f1 - f2 example is a useful behavior, and I'm downright horrified by the qty - quantity case. If the hint mechanism thinks the latter two are close enough together to suggest, it's going to be spewing a whole lot of utterly ridiculous suggestions. I'm going to be annoyed way more times than I'm going to be helped. The big picture is that this is more or less our first venture into heuristic suggestions. I think we should start slow with a very conservative set of heuristics. If it's a success maybe we can get more aggressive over time --- but if we go over the top here, the entire concept will be discredited in this community for the next ten years. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 8:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: I'm not particularly convinced that the f1 - f2 example is a useful behavior, and I'm downright horrified by the qty - quantity case. If the hint mechanism thinks the latter two are close enough together to suggest, it's going to be spewing a whole lot of utterly ridiculous suggestions. I'm going to be annoyed way more times than I'm going to be helped. I happen to think that that isn't the case, because the number of possible suggestions is fairly low anyway, and people don't tend to make those kind of errors. Robert's examples of ridiculous suggestions of quantity based on three letter strings other than qty (e.g. tit) were rather contrived. In fact, most 3 letter strings will not offer a suggestion. 3 or more Equidistant would-be matches tend to offer a lot of additional protection against bad suggestions for these terse strings. The big picture is that this is more or less our first venture into heuristic suggestions. I think we should start slow with a very conservative set of heuristics. If it's a success maybe we can get more aggressive over time --- but if we go over the top here, the entire concept will be discredited in this community for the next ten years. I certainly see your point here. It's not as if we have an *evolved* understanding of the usability issues. Besides, as Robert pointed out, most of the value of this patch is added by simple cases, like a failure to pluralize or not pluralize, or the omission of an underscore. I still think we should charge half for deletion, but I will concede that it's prudent to apply a more restrictive absolute quality final test. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 7:32 AM, Robert Haas robertmh...@gmail.com wrote: In general, I think the cost of a bad suggestion is much lower than the benefit of a good one. You seem to be suggesting that they're equal. Or that they're equally likely in an organic situation. In my estimation, this is not the case at all. The way I see it, the main cost of a bad suggestion is that it annoys the user with clutter which they may brand as stupid. Think about how much vitriol has been spewed over the years against progress bars (or estimated completion) times that don't turn out to mirror reality. Well, you can judge the quality of the suggestion immediately. I imagined a mechanism that gives a little bit more than the minimum amount of guidance for things like contractions/abbreviations. Microsoft has gotten more cumulative flack about their inaccurate progress bars over the years than they would have for dropping an elevator on a cute baby. I haven't used a more recent version of Windows than Windows Vista, but I'm pretty sure that they kept it up. I'm curious about your thoughts on the compromise of a ramped up distance threshold to apply a test for the absolute quality of a match. I think that the fact that git gives bad suggestions with terse strings tells us a lot, though. Note that unlike git, with terse strings we may well have a good deal more equidistant matches, and as soon as the number of would-be matches exceeds 2, we actually give no matches at all. So that's an additional protection against poor matches with terse strings. I don't know what you mean by a ramped-up distance threshold, exactly. I think it's good for the distance threshold to be lower for small strings and higher for large ones. I think I'm somewhat open to negotiation on the details, but I think any system that's going to suggest quantity for tit is going too far. I mean the suggestion of raising the cost threshold more gradually, not as a step function of the number of characters in the string [1] where it's either over 6 characters and must pass the 50% test, or isn't and has no absolute quality test. The exact modification I described will FWIW remove the quantity for qty suggestion, as well as all the similar suggestions that you found objectionable (like tit also offering a suggestion of quantity). If you look at the regression tests, none of the sensible suggestions are lost (some would be by an across the board 50% absolute quality threshold, as I previously pointed out [2]), but all the bad ones are. I attach failed regression test output showing the difference between the previous expected values, and actual values with that small modification - it looks like most or all bad cases are now fixed. If the user types qty when they meant quantity, they probably don't really need the hint, because they're going to say to themselves wait, I guess I didn't abbreviate that. The time when they need the hint is when they typed quanttiy, because it's quite possible to read a query with that sort of typo multiple times and not realize that you've made one. I agree that that's a more important case. In other words, I think there's value in trying to clue somebody in when they've made a typo, but not when they've made a think-o. We won't be able to do the latter accurately enough to make it more useful than annoying. That's certainly true; I think that we only disagree about the exact point at which we enter the think-o correction business. [1] http://www.postgresql.org/message-id/CAM3SWZT+7hH29Go6ZuY2OrCS40=6ypvm_nt9njfovp3xwji...@mail.gmail.com [2] http://www.postgresql.org/message-id/cam3swztsgoknht8rk+0eed7spnjg4padmbqqyi0fh9bwcnv...@mail.gmail.com -- Peter Geoghegan regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 1:30 PM, Peter Geoghegan p...@heroku.com wrote: I mean the suggestion of raising the cost threshold more gradually, not as a step function of the number of characters in the string [1] where it's either over 6 characters and must pass the 50% test, or isn't and has no absolute quality test. The exact modification I described will FWIW remove the quantity for qty suggestion, as well as all the similar suggestions that you found objectionable (like tit also offering a suggestion of quantity). If you look at the regression tests, none of the sensible suggestions are lost (some would be by an across the board 50% absolute quality threshold, as I previously pointed out [2]), but all the bad ones are. I attach failed regression test output showing the difference between the previous expected values, and actual values with that small modification - it looks like most or all bad cases are now fixed. That does seem to give better results, but it still seems awfully complicated. If we just used Levenshtein with all-default cost factors and a distance cap equal to Max(strlen(what_user_typed), strlen(candidate_match), 3), what cases that you think are important would be harmed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote: That does seem to give better results, but it still seems awfully complicated. If we just used Levenshtein with all-default cost factors and a distance cap equal to Max(strlen(what_user_typed), strlen(candidate_match), 3), what cases that you think are important would be harmed? Well, just by plugging in default Levenshtein cost factors, I can see the following regression: *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20 10:17:55.042291912 -0800 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20 11:42:15.670108745 -0800 *** *** 3452,3458 ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column atts.indexrelid. Within the catalogs, the names of attributes are prefixed as a form of what you might call internal namespacing. For example, pg_index has attributes that all begin with ind*. You could easily omit something like that, while still more or less knowing what you're looking for. In more concrete terms, this gets no suggestion: postgres=# select key from pg_index; ERROR: 42703: column key does not exist LINE 1: select key from pg_index; ^ Only this does: postgres=# select ikey from pg_index; ERROR: 42703: column ikey does not exist LINE 1: select ikey from pg_index; ^ HINT: Perhaps you meant to reference the column pg_index.indkey. postgres=# The git people varied their Levenshtein costings for a reason. I also think that a one size fits all cap will break things. It will independently break the example above, as well as the more marginal c1.f2. vs c1.f2 case (okay, maybe that case was exactly on the threshold, but others won't be). I don't see that different costings actually saves any complexity. Similarly, the final cap is quite straightforward. Anything with any real complexity happens before that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 3:00 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote: That does seem to give better results, but it still seems awfully complicated. If we just used Levenshtein with all-default cost factors and a distance cap equal to Max(strlen(what_user_typed), strlen(candidate_match), 3), what cases that you think are important would be harmed? Well, just by plugging in default Levenshtein cost factors, I can see the following regression: *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20 10:17:55.042291912 -0800 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20 11:42:15.670108745 -0800 *** *** 3452,3458 ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column atts.indexrelid. Within the catalogs, the names of attributes are prefixed as a form of what you might call internal namespacing. For example, pg_index has attributes that all begin with ind*. You could easily omit something like that, while still more or less knowing what you're looking for. In more concrete terms, this gets no suggestion: postgres=# select key from pg_index; ERROR: 42703: column key does not exist LINE 1: select key from pg_index; ^ Only this does: postgres=# select ikey from pg_index; ERROR: 42703: column ikey does not exist LINE 1: select ikey from pg_index; ^ HINT: Perhaps you meant to reference the column pg_index.indkey. postgres=# Seems fine to me. If you typed relid rather than indexrelid or key rather than indkey, that's a thinko, not a typo. ikey for indkey could plausible be a typo, though you'd have to be having a fairly bad day at the keyboard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas robertmh...@gmail.com wrote: Seems fine to me. If you typed relid rather than indexrelid or key rather than indkey, that's a thinko, not a typo. ikey for indkey could plausible be a typo, though you'd have to be having a fairly bad day at the keyboard. I can tell that I have no chance of convincing you otherwise. While I think you're mistaken to go against the precedent set by git, you're the one with the commit bit, and I think we've already spent enough time discussing this. So default costings it is. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On 2014-11-20 12:00:51 -0800, Peter Geoghegan wrote: In more concrete terms, this gets no suggestion: postgres=# select key from pg_index; ERROR: 42703: column key does not exist LINE 1: select key from pg_index; ^ I don't think that's a bad thing. Yes, for a human those look pretty similar, but it's easy to construct cases where that gives completely hilarious results. I think something simplistic like levenshtein, even with modified distances, is good to catch typos. But not to find terms that are related in more complex ways. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan p...@heroku.com writes: On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote: That does seem to give better results, but it still seems awfully complicated. If we just used Levenshtein with all-default cost factors and a distance cap equal to Max(strlen(what_user_typed), strlen(candidate_match), 3), what cases that you think are important would be harmed? Well, just by plugging in default Levenshtein cost factors, I can see the following regression: *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20 10:17:55.042291912 -0800 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20 11:42:15.670108745 -0800 *** *** 3452,3458 ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column atts.indexrelid. I do not have a problem with deciding that that is not a regression; in fact, not giving that hint seems like a good conservative behavior here. By your logic, we should also be prepared to suggest supercalifragilisticexpialidocious when the user enters ocious. It's simply a bridge too far for what is supposed to be a hint for minor typos. You sound like you want to turn it into something that will look up column names for people who are too lazy to even try to type the right thing. While I can see the value of such a tool within certain contexts, firing completed queries at a live SQL engine is not one of them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 12:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: I do not have a problem with deciding that that is not a regression; in fact, not giving that hint seems like a good conservative behavior here. By your logic, we should also be prepared to suggest supercalifragilisticexpialidocious when the user enters ocious. That's clearly not true. I just want to be a bit more forgiving of omissions. That clearly isn't my logic, since that isn't a suggestion that the implementation will give, or would be anywhere close to giving - my weighing of deletions is only twice that of substitutions or insertions, not ten times. git does not use Levenshtein default costings either. It's simply a bridge too far for what is supposed to be a hint for minor typos. Minor typos and minor omissions. My example was on the edge of what would be tolerable under my proposed cost model. You sound like you want to turn it into something that will look up column names for people who are too lazy to even try to type the right thing. While I can see the value of such a tool within certain contexts, firing completed queries at a live SQL engine is not one of them. It's just a hint; a convenience. Users who imagine that it takes away the need for putting any thought into their SQL queries have bigger problems. Anyway, that's all that needs to be said on that, since I've already given up on a non-default costing. Also, we have default costing, and we always apply a 50% standard (I see no point in doing otherwise with default costings). Robert: Where does that leave us? What about suggestions across RTEs? Alias costing, etc? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 3:20 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas robertmh...@gmail.com wrote: Seems fine to me. If you typed relid rather than indexrelid or key rather than indkey, that's a thinko, not a typo. ikey for indkey could plausible be a typo, though you'd have to be having a fairly bad day at the keyboard. I can tell that I have no chance of convincing you otherwise. While I think you're mistaken to go against the precedent set by git, you're the one with the commit bit, and I think we've already spent enough time discussing this. So default costings it is. I've got a few +1s, too, if you notice. I'm willing to be outvoted, but not by a majority of one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 20, 2014 at 12:50 PM, Robert Haas robertmh...@gmail.com wrote: I've got a few +1s, too, if you notice. Then maybe I spoke too soon. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Nov 18, 2014 at 8:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Geoghegan p...@heroku.com writes: On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote: postgres=# select qty from orderlines ; ERROR: 42703: column qty does not exist HINT: Perhaps you meant to reference the column orderlines.quantity. I don't buy this example, because it would give you the same hint if you told it you wanted to access a column called ant, or uay, or tit. And that's clearly ridiculous. The reason why quantity looks like a reasonable suggestion for qty is because it's a conventional abbreviation, but an extremely high percentage of comparable cases won't be. I maintain that omission of part of the correct spelling should be weighed less. I would say that omission of the first letter should completely disqualify suggestions based on this heuristic; but it might make sense to weight omissions less after the first letter. I think we would be well-advised not to start inventing our own approximate matching algorithm. Peter's suggestion boils down to a guess that the default cost parameters for Levenshtein suck, and your suggestion boils down to a guess that we can fix the problems with Peter's suggestion by bolting another heuristic on top of it - and possibly running Levenshtein twice with different sets of cost parameters. Ugh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas robertmh...@gmail.com wrote: I think we would be well-advised not to start inventing our own approximate matching algorithm. Peter's suggestion boils down to a guess that the default cost parameters for Levenshtein suck, and your suggestion boils down to a guess that we can fix the problems with Peter's suggestion by bolting another heuristic on top of it - and possibly running Levenshtein twice with different sets of cost parameters. Ugh. I agree. While I am perfectly comfortable with the fact that we are guessing here, my guesses are based on what I observed to work well with real schemas, and simulated errors that I thought were representative of human error. Obviously it's possible that another scheme will do better sometimes, including for example a scheme that picks a match entirely at random. But on average, I think that what I have here will do better than anything else proposed so far. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 12:33 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas robertmh...@gmail.com wrote: I think we would be well-advised not to start inventing our own approximate matching algorithm. Peter's suggestion boils down to a guess that the default cost parameters for Levenshtein suck, and your suggestion boils down to a guess that we can fix the problems with Peter's suggestion by bolting another heuristic on top of it - and possibly running Levenshtein twice with different sets of cost parameters. Ugh. I agree. While I am perfectly comfortable with the fact that we are guessing here, my guesses are based on what I observed to work well with real schemas, and simulated errors that I thought were representative of human error. Obviously it's possible that another scheme will do better sometimes, including for example a scheme that picks a match entirely at random. But on average, I think that what I have here will do better than anything else proposed so far. If you agree, then I'm not being clear enough. I don't think think that tinkering with the Levenshtein cost factors is a good idea, and I think it's unhelpful to suggest something when the suggestion and the original word differ by more than 50% of the number characters in the shorter word. Suggesting col for oid or x for xmax, as crops up in the regression tests with this patch applied, shows the folly of this: the user didn't mean the other named column; rather, the user was confused about whether a particular system column existed for that table. If we had a large database of examples showing what the user typed and what they intended, we could try different algorithms against it and see which one performs best with fewest false positives. But if we don't have that, we should do things that are like the things that other people have done before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas robertmh...@gmail.com wrote: If you agree, then I'm not being clear enough. I don't think think that tinkering with the Levenshtein cost factors is a good idea, and I think it's unhelpful to suggest something when the suggestion and the original word differ by more than 50% of the number characters in the shorter word. I agree - except for very short strings, where there is insufficient information to go on. I was talking about the difficulty of bolting something on top of Levenshtein distance that looked at the first character. That would not need two costings, but would require an encoding aware matching of the first code point. Suggesting col for oid or x for xmax, as crops up in the regression tests with this patch applied, shows the folly of this: the user didn't mean the other named column; rather, the user was confused about whether a particular system column existed for that table. Those are all very terse strings. What you're overlooking is what is broken by using straight Levenshtein distance, which includes things in the regression test that are reasonable and helpful. As I mentioned before, requiring a greater than 50% of total string size distance breaks this, just within the regression tests: ERROR: column f1 does not exist LINE 1: select f1 from c1; ^ - HINT: Perhaps you meant to reference the column c1.f2. And: ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column atts.indexrelid. Those are really useful suggestions! And, they're much more representative of real user error. The downside of weighing deletion less than substitution and insertion is much smaller than the upside. It's worth it. The downside is only that the user gets to see the best suggestion that isn't all that good in an absolute sense (which we have a much harder time concluding using simple tests for short misspellings). If we had a large database of examples showing what the user typed and what they intended, we could try different algorithms against it and see which one performs best with fewest false positives. But if we don't have that, we should do things that are like the things that other people have done before. That seems totally impractical. No one has that kind of data that I'm aware of. How about git as a kind of precedent? It is not at all conservative about showing *some* suggestion: $ git aa git: 'aa' is not a git command. See 'git --help'. Did you mean this? am $ git d git: 'd' is not a git command. See 'git --help'. Did you mean one of these? diff add $ git ddd git: 'ddd' is not a git command. See 'git --help'. Did you mean this? add And why wouldn't git be? As far as its concerned, you can only have meant one of those small number of things. Similarly, with the patch, the number of things we can pick from is fairly limited at this stage, since we are actually fairly far along with parse analysis. Now, this won't give a suggestion: $ git git: '' is not a git command. See 'git --help'. So it looks like git similarly weighs deletion less than insertion/substitution. Are its suggestions any less ridiculous than your examples of questionable hints from the modified regression test expected output? This is just a guidance mechanism, and at worst we'll show the best match (on the mechanisms own terms, which isn't too bad). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 10:22 AM, Peter Geoghegan p...@heroku.com wrote: Those are all very terse strings. What you're overlooking is what is broken by using straight Levenshtein distance, which includes things in the regression test that are reasonable and helpful. As I mentioned before, requiring a greater than 50% of total string size distance breaks this, just within the regression tests: Maybe you'd prefer if there was a more gradual ramp-up to requiring a distance of no greater than 50% of the string size (normalized to take account of my non-default costings). Right now it's a step function of the number of characters in the string - there is no absolute quality requirement for strings of 6 or fewer requirements. Otherwise, there is the 50% distance absolute quality test (the test that you want to be applied generally). I think that would be better, without being much more complicated. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan p...@heroku.com wrote: there is no absolute quality requirement for strings of 6 or fewer requirements. I meant 6 or fewer *characters*, obviously. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan p...@heroku.com wrote: Maybe you'd prefer if there was a more gradual ramp-up to requiring a distance of no greater than 50% of the string size (normalized to take account of my non-default costings) I made this modification: diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c index 40c69d7..cca075f 100644 --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -929,7 +929,8 @@ searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname, * seen when 6 deletions are required against actual attribute name, or 3 * insertions/substitutions. */ - if (state-distance 6 state-distance strlen(colname) / 2) + if ((state-distance 3 state-distance strlen(colname)) || + (state-distance 6 state-distance strlen(colname) / 2)) { state-rsecond = state-rfirst = NULL; state-second = state-first = InvalidAttrNumber; When I run the regression tests now, then all the cases that you found objectionable in the regression tests' previous expected output disappear, while all the cases I think are useful that were previously removed by applying a broad 50% standard remain. While I'm not 100% sure that this exact formulation is the best one, I think that we can reach a compromise on this point, that allows the costing to remain the same without offering particularly bad suggestions for short strings. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 1:22 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas robertmh...@gmail.com wrote: If you agree, then I'm not being clear enough. I don't think think that tinkering with the Levenshtein cost factors is a good idea, and I think it's unhelpful to suggest something when the suggestion and the original word differ by more than 50% of the number characters in the shorter word. I agree - except for very short strings, where there is insufficient information to go on. That's precisely the time I think it's *most* important. In a very long string, the threshold should be LESS than 50%. My original proposal was no more than 3 characters of difference, but in any event not more than half the length of the shorter string. Suggesting col for oid or x for xmax, as crops up in the regression tests with this patch applied, shows the folly of this: the user didn't mean the other named column; rather, the user was confused about whether a particular system column existed for that table. Those are all very terse strings. What you're overlooking is what is broken by using straight Levenshtein distance, which includes things in the regression test that are reasonable and helpful. As I mentioned before, requiring a greater than 50% of total string size distance breaks this, just within the regression tests: ERROR: column f1 does not exist LINE 1: select f1 from c1; ^ - HINT: Perhaps you meant to reference the column c1.f2. That's exactly 50%, not more than 50%. (I'm also on the fence about whether the hint is actually helpful in that case, but the rule I proposed wouldn't prohibit it.) And: ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column atts.indexrelid. Those are really useful suggestions! And, they're much more representative of real user error. That one's right at 50% too, but it's certainly more than 3 characters of difference. I think it's going to be pretty hard to emit a suggestion in that case but not in a whole lot of cases that don't make any sense. How about git as a kind of precedent? It is not at all conservative about showing *some* suggestion: $ git aa git: 'aa' is not a git command. See 'git --help'. Did you mean this? am $ git d git: 'd' is not a git command. See 'git --help'. Did you mean one of these? diff add $ git ddd git: 'ddd' is not a git command. See 'git --help'. Did you mean this? add And why wouldn't git be? As far as its concerned, you can only have meant one of those small number of things. Similarly, with the patch, the number of things we can pick from is fairly limited at this stage, since we are actually fairly far along with parse analysis. I went and found the actual code git uses for this. It's here: https://github.com/git/git/blob/d29e9c89dbbf0876145dc88615b99308cab5f187/help.c And the underlying Levenshtein implementation is here: https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c Apparently what they're doing is charging 0 for a transposition (which we don't have as a separate concept), 2 for a substitution, 1 for an insertion, and 3 for a deletion, with the constraint that anything with a total distance of more than 6 isn't considered. And that does overall seem to give pretty good suggestions. However, an interesting point about the git algorithm is that it's not hard to make it do stupid things on short strings: [rhaas pgsql]$ git xy git: 'xy' is not a git command. See 'git --help'. Did you mean one of these? am gc mv rm Maybe they should adopt my idea of a lower cutoff for short strings. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote: Apparently what they're doing is charging 0 for a transposition (which we don't have as a separate concept), 2 for a substitution, 1 for an insertion, and 3 for a deletion, with the constraint that anything with a total distance of more than 6 isn't considered. And that does overall seem to give pretty good suggestions. The git people know that no git command is longer than 4 or 5 characters. That doesn't apply to us. I certainly would not like to have an absolute distance test of n characters. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote: That's precisely the time I think it's *most* important. In a very long string, the threshold should be LESS than 50%. My original proposal was no more than 3 characters of difference, but in any event not more than half the length of the shorter string. We can only hint based on the information given by the user. If they give a lot of badly matching information, we have something to go on. That one's right at 50% too, but it's certainly more than 3 characters of difference. I think it's going to be pretty hard to emit a suggestion in that case but not in a whole lot of cases that don't make any sense. I don't think that's the case. Other RTEs are penalized for having non-matching aliases here. In general, I think the cost of a bad suggestion is much lower than the benefit of a good one. You seem to be suggesting that they're equal. Or that they're equally likely in an organic situation. In my estimation, this is not the case at all. I'm curious about your thoughts on the compromise of a ramped up distance threshold to apply a test for the absolute quality of a match. I think that the fact that git gives bad suggestions with terse strings tells us a lot, though. Note that unlike git, with terse strings we may well have a good deal more equidistant matches, and as soon as the number of would-be matches exceeds 2, we actually give no matches at all. So that's an additional protection against poor matches with terse strings. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Robert Haas wrote: And the underlying Levenshtein implementation is here: https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c Apparently what they're doing is charging 0 for a transposition (which we don't have as a separate concept), 2 for a substitution, 1 for an insertion, and 3 for a deletion, with the constraint that anything with a total distance of more than 6 isn't considered. 0 for a transposition, wow. I suggested adding transpositions but there was no support for that idea. I suggested it because I thikn it's the most common form of typo, and charging 2 for a deletion plus 1 for an insertion makes a single transposition mistaek count as 3, which seems wrong -- particularly seeing the git precedent. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: 0 for a transposition, wow. Again, they're optimizing for short strings (git commands) only. There just isn't that many transposition errors possible with a 4 character string. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan wrote: On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: 0 for a transposition, wow. Again, they're optimizing for short strings (git commands) only. There just isn't that many transposition errors possible with a 4 character string. If there's logic in your statement, I can't see it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Peter Geoghegan wrote: On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: 0 for a transposition, wow. Again, they're optimizing for short strings (git commands) only. There just isn't that many transposition errors possible with a 4 character string. If there's logic in your statement, I can't see it. The point is that transposition errors should not have no cost. If git did not have an absolute quality test of a distance of 6, which they can only have because all git commands are terse, then you could construct a counter example. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan wrote: On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Peter Geoghegan wrote: On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: 0 for a transposition, wow. Again, they're optimizing for short strings (git commands) only. There just isn't that many transposition errors possible with a 4 character string. If there's logic in your statement, I can't see it. The point is that transposition errors should not have no cost. If git did not have an absolute quality test of a distance of 6, which they can only have because all git commands are terse, then you could construct a counter example. Okay. My point is just that whatever the string length, I think we'd do well to regard transpositions as cheap in terms of error cost. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote: postgres=# select qty from orderlines ; ERROR: 42703: column qty does not exist LINE 1: select qty from orderlines ; ^ HINT: Perhaps you meant to reference the column orderlines.quantity. I don't buy this example, because it would give you the same hint if you told it you wanted to access a column called ant, or uay, or tit. And that's clearly ridiculous. The reason why quantity looks like a reasonable suggestion for qty is because it's a conventional abbreviation, but an extremely high percentage of comparable cases won't be. +/* + * Charge extra (for inexact matches only) when an alias was + * specified that differs from what might have been used to + * correctly qualify this RTE's closest column + */ +if (wrongalias) +rtestate.distance += 3; I don't understand what situation this is catering to. Can you explain? It seems to account for a good deal of complexity. Two cases: 1. Distinguishing between the case where there was an exact match to a column that isn't visible (i.e. the existing reason for errorMissingColumn() to call here), and the case where there is a visible column, but our alias was the wrong one. I guess that could live in errorMissingColumn(), but overall it's more convenient to do it here, so that errorMissingColumn() handles things almost uniformly and doesn't really have to care. 2. For non-exact (fuzzy) matches, it seems more useful to give one match rather than two when the user gave an alias that matches one particular RTE. Consider this: postgres=# select ordersid from orders o join orderlines ol on o.orderid = ol.orderid; ERROR: 42703: column ordersid does not exist LINE 1: select ordersid from orders o join orderlines ol on o.orderi... ^ HINT: Perhaps you meant to reference the column o.orderid or the column ol.orderid. LOCATION: errorMissingColumn, parse_relation.c:3166 postgres=# select ol.ordersid from orders o join orderlines ol on o.orderid = ol.orderid; ERROR: 42703: column ol.ordersid does not exist LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord... ^ HINT: Perhaps you meant to reference the column ol.orderid. LOCATION: errorMissingColumn, parse_relation.c:3147 I guess I'm confused at a broader level. If the alias is wrong, why are we considering names in this RTE *at all*? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote: postgres=# select qty from orderlines ; ERROR: 42703: column qty does not exist LINE 1: select qty from orderlines ; ^ HINT: Perhaps you meant to reference the column orderlines.quantity. I don't buy this example, because it would give you the same hint if you told it you wanted to access a column called ant, or uay, or tit. And that's clearly ridiculous. The reason why quantity looks like a reasonable suggestion for qty is because it's a conventional abbreviation, but an extremely high percentage of comparable cases won't be. Is that so terrible? Yes, if those *exact* strings are tried, that'll happen. But the vast majority of 3 letter strings will not do that (including many 3 letter strings that include one of the letters 'q', 't' and 'y', such as qqq, ttt, and yyy). Why, in practice, would someone even attempt those strings? I'm worried about Murphy, not Machiavelli. That seems like a pretty important distinction here. I maintain that omission of part of the correct spelling should be weighed less. I am optimizing for the case where the user has a rough idea of the structure and spelling of things - if they're typing in random strings, or totally distinct synonyms, there is little we can do about that. As I said, there will always be the most marginal case that still gets a suggestion. I see no reason to hurt the common case where we help in order to save the user from seeing a ridiculous suggestion. I have a final test for the absolute quality of a suggestion, but I think we could easily be too conservative about that. At worst, our ridiculous suggestion makes apparent that the user's incorrect spelling was itself ridiculous. With larger strings, we can afford to be more conservative, and we are, because we have more information to go on. Terse column names are not uncommon, though. +/* + * Charge extra (for inexact matches only) when an alias was + * specified that differs from what might have been used to + * correctly qualify this RTE's closest column + */ +if (wrongalias) +rtestate.distance += 3; I don't understand what situation this is catering to. Can you explain? It seems to account for a good deal of complexity. I guess I'm confused at a broader level. If the alias is wrong, why are we considering names in this RTE *at all*? Because it's a common mistake when writing ad-hoc queries. People may forget which exact table their column comes from. We certainly want to weigh the fact that an alias was specified, but it shouldn't totally limit our guess to that RTE. If nothing else, the fact that there was a much closer match from another RTE ought to result in forcing there to be no suggestion (due to there being too many equally good suggestions). That's because, as I said, an *absolute* test for the quality of a match is problematic (which, again, is why I err on the side of letting the final, absolute quality test not limit suggestions, particularly with short strings). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan p...@heroku.com writes: On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote: postgres=# select qty from orderlines ; ERROR: 42703: column qty does not exist HINT: Perhaps you meant to reference the column orderlines.quantity. I don't buy this example, because it would give you the same hint if you told it you wanted to access a column called ant, or uay, or tit. And that's clearly ridiculous. The reason why quantity looks like a reasonable suggestion for qty is because it's a conventional abbreviation, but an extremely high percentage of comparable cases won't be. I maintain that omission of part of the correct spelling should be weighed less. I would say that omission of the first letter should completely disqualify suggestions based on this heuristic; but it might make sense to weight omissions less after the first letter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch incorporates this feedback. The only user-visible difference between this revision and the previous revision is that it's quite possible for two suggestion to originate from the same RTE (there is exactly one change in the regression test's expected output as compared to the last revision for this reason. The regression tests are otherwise unchanged). It's still not possible to see more than 2 suggestions under any circumstances, no matter where they might have originated from, which I think is appropriate -- we continue to not present any HINT in the event of 3 or more equidistant matches. I think that the restructuring required to pass around a state variable has resulted in somewhat clearer code. Cool! I'm grumpy about the distanceName() function. That seems too generic. If we're going to keep this as it is, I suggest something like computeRTEColumnDistance(). But see below. On a related note, I'm also grumpy about this comment: +/* Charge half as much per deletion as per insertion or per substitution */ +return varstr_levenshtein_less_equal(actual, len, match, match_len, + 2, 1, 2, max); The purpose of a code comment is to articulate WHY we did something, rather than simply to restate what the code quite obviously does. I haven't heard a compelling argument for why this should be 2, 1, 2 rather than the default 1, 1, 1; and I'm inclined to do the latter unless you can make some very good argument for this combination of weights. And if you can make such an argument, then there should be comments so that the next person to come along and look at this code doesn't go, huh, that's whacky, and change it. + int location, FuzzyAttrMatchState *rtestate) I suggest calling this fuzzystate rather than rtestate; it's not the state of the RTE, but the state of the fuzzy matching. Within the scanRTEForColumn block, we have a rather large chunk of code protected by if (rtestate), which contains the only call to distanceName(). I suggest that we move all of this logic to a separate, static function, and merge distanceName into it. I also suggest testing against NULL explicitly instead of implicitly. So this block of code would end up as something like: if (fuzzystate != NULL) updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate); In searchRangeTableForCol, I'm fairly certain that you've changed the behavior by adding a check for if (rte-rtekind == RTE_JOIN) before the call to scanRTEForColumn(). Why not instead put this check into updateFuzzyAttrMatchState? Then you can be sure you're not changing the behavior in any other case. On a similar note, I think the dropped-column test should happen early as well, probably again in updateFuzzyAttrMatchState(). There's little point in adding a suggestion only to throw it away again. +/* + * Charge extra (for inexact matches only) when an alias was + * specified that differs from what might have been used to + * correctly qualify this RTE's closest column + */ +if (wrongalias) +rtestate.distance += 3; I don't understand what situation this is catering to. Can you explain? It seems to account for a good deal of complexity. ERROR: column oid does not exist LINE 1: select oid 0, * from altwithoid; ^ +HINT: Perhaps you meant to reference the column altwithoid.col. That seems like a stretch. I think I suggested before using a distance threshold of at most 3 or half the word length, whichever is less. For a three-letter column name that means not suggesting anything if more than one character is different. What you implemented here is close to that, yet somehow we've got a suggestion slipping through that has 2 out of 3 characters different. I'm not quite sure I see how that's getting through, but I think it shouldn't. ERROR: column fullname.text does not exist LINE 1: select fullname.text from fullname; ^ +HINT: Perhaps you meant to reference the column fullname.last. Same problem, only worse! They've only got one letter of four in common. ERROR: column xmin does not exist LINE 1: select xmin, * from fooview; ^ +HINT: Perhaps you meant to reference the column fooview.x. Basically the same problem again. I think the distance threshold in this case should be half the shorter column name, i.e. 0. Your new test cases include no negative test cases; that is, cases where the machinery declines to suggest a hint because of, say, 3 equally good possibilities. They probably should have something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas robertmh...@gmail.com wrote: I'm grumpy about the distanceName() function. That seems too generic. If we're going to keep this as it is, I suggest something like computeRTEColumnDistance(). But see below. Fair point. On a related note, I'm also grumpy about this comment: +/* Charge half as much per deletion as per insertion or per substitution */ +return varstr_levenshtein_less_equal(actual, len, match, match_len, + 2, 1, 2, max); The purpose of a code comment is to articulate WHY we did something, rather than simply to restate what the code quite obviously does. I haven't heard a compelling argument for why this should be 2, 1, 2 rather than the default 1, 1, 1; and I'm inclined to do the latter unless you can make some very good argument for this combination of weights. And if you can make such an argument, then there should be comments so that the next person to come along and look at this code doesn't go, huh, that's whacky, and change it. Okay. I agree that that deserves a comment. The actual argument for this formulation is that it just seems to work better that way. For example: postgres=# \d orderlines Table public.orderlines Column| Type | Modifiers -+--+--- orderlineid | integer | not null orderid | integer | not null prod_id | integer | not null quantity| smallint | not null orderdate | date | not null postgres=# select qty from orderlines ; ERROR: 42703: column qty does not exist LINE 1: select qty from orderlines ; ^ HINT: Perhaps you meant to reference the column orderlines.quantity. The point is that the fact that the user supplied qty string has so many fewer characters than what was obviously intended - quantity - deserves to be weighed less. If you change the costing to weigh character deletion as being equal to substitution/addition, this example breaks. I also think it's pretty common to have noise words in every attribute (e.g. every column in the orderlines table matches orderlines_*), which might otherwise mess things up by overcharging for deletion. Having extra characters in the correctly spelled column name seems legitimately less significant to me. Or, in other words: having actual characters from the misspelling match the correct spelling (and having actual characters given not fail to match) is most important. What was given by the user is more important than what was not given but should have been, which is not generally true for uses of Levenshtein distance. I reached this conclusion through trying out the patch with a couple of real schemas, and seeing what works best. It's hard to express that idea tersely, in a comment, but I guess I'll try. + int location, FuzzyAttrMatchState *rtestate) I suggest calling this fuzzystate rather than rtestate; it's not the state of the RTE, but the state of the fuzzy matching. The idea here was to differentiate this state from the overall range table state (in general, FuzzyAttrMatchState may be one or the other). But okay. Within the scanRTEForColumn block, we have a rather large chunk of code protected by if (rtestate), which contains the only call to distanceName(). I suggest that we move all of this logic to a separate, static function, and merge distanceName into it. I also suggest testing against NULL explicitly instead of implicitly. So this block of code would end up as something like: if (fuzzystate != NULL) updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate); Okay. In searchRangeTableForCol, I'm fairly certain that you've changed the behavior by adding a check for if (rte-rtekind == RTE_JOIN) before the call to scanRTEForColumn(). Why not instead put this check into updateFuzzyAttrMatchState? Then you can be sure you're not changing the behavior in any other case. I thought that I had avoided changing things (beyond what was advertised as changed in relation to this most recent revision) because I also changed things WRT multiple matches per RTE. It's fuzzy. Anyway, yeah, I could do it there instead. On a similar note, I think the dropped-column test should happen early as well, probably again in updateFuzzyAttrMatchState(). There's little point in adding a suggestion only to throw it away again. Agreed. +/* + * Charge extra (for inexact matches only) when an alias was + * specified that differs from what might have been used to + * correctly qualify this RTE's closest column + */ +if (wrongalias) +rtestate.distance += 3; I don't understand what situation this is catering to. Can you explain? It seems to account for a good deal of complexity. Two cases: 1. Distinguishing between the case where there was an exact match to a column that isn't visible
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote: On that topic, I think there's unanimous consensus against the design where equally-distant matches are treated differently based on whether they are in the same RTE or different RTEs. I think you need to change that if you want to get anywhere with this. On a related note, the use of the additional parameter AttrNumber closest[2] to searchRangeTableForCol() and of the additional parameters AttrNumber *matchedatt and int *distance to scanRTEForColumn() is less than self-documenting. I suggest creating a structure called something like FuzzyAttrMatchState and passing a pointer to it down to both functions. Attached patch incorporates this feedback. The only user-visible difference between this revision and the previous revision is that it's quite possible for two suggestion to originate from the same RTE (there is exactly one change in the regression test's expected output as compared to the last revision for this reason. The regression tests are otherwise unchanged). It's still not possible to see more than 2 suggestions under any circumstances, no matter where they might have originated from, which I think is appropriate -- we continue to not present any HINT in the event of 3 or more equidistant matches. I think that the restructuring required to pass around a state variable has resulted in somewhat clearer code. -- Peter Geoghegan From 0aef5253f10ebb1ee5bbcc73782eff1352c7ab84 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Wed, 12 Nov 2014 15:31:37 -0800 Subject: [PATCH] Levenshtein distance column HINT Add a new HINT -- a guess as to what column the user might have intended to reference, to be shown in various contexts where an ERRCODE_UNDEFINED_COLUMN error is raised. The user will see this HINT when he or she fat-fingers a column reference in an ad-hoc SQL query, or incorrectly pluralizes or fails to pluralize a column reference, or incorrectly omits or includes an underscore or other punctuation character. The HINT suggests a column in the range table with the lowest Levenshtein distance, or the tied-for-best pair of matching columns in the event of there being exactly two equally likely candidates (these may come from multiple RTEs, or the same RTE). Limiting to two the number of cases where multiple equally likely suggestions are all offered at once (i.e. giving no hint when the number of equally likely candidates exceeds two) is a measure against suggestions that are of low quality in an absolute sense. A further, final measure is taken against suggestions that are of low absolute quality: If the distance exceeds a normalized distance threshold, no suggestion is given. --- src/backend/parser/parse_expr.c | 9 +- src/backend/parser/parse_func.c | 2 +- src/backend/parser/parse_relation.c | 345 +++--- src/backend/utils/adt/levenshtein.c | 9 + src/include/parser/parse_relation.h | 20 +- src/test/regress/expected/alter_table.out | 8 + src/test/regress/expected/join.out| 39 src/test/regress/expected/plpgsql.out | 1 + src/test/regress/expected/rowtypes.out| 1 + src/test/regress/expected/rules.out | 1 + src/test/regress/expected/without_oid.out | 2 + src/test/regress/sql/join.sql | 24 +++ 12 files changed, 421 insertions(+), 40 deletions(-) diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 4a8aaf6..a77a3a0 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref) colname = strVal(field2); /* Try to identify as a column of the RTE */ -node = scanRTEForColumn(pstate, rte, colname, cref-location); +node = scanRTEForColumn(pstate, rte, colname, cref-location, + NULL); if (node == NULL) { /* Try it as a function call on the whole row */ @@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref) colname = strVal(field3); /* Try to identify as a column of the RTE */ -node = scanRTEForColumn(pstate, rte, colname, cref-location); +node = scanRTEForColumn(pstate, rte, colname, cref-location, + NULL); if (node == NULL) { /* Try it as a function call on the whole row */ @@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref) colname = strVal(field4); /* Try to identify as a column of the RTE */ -node = scanRTEForColumn(pstate, rte, colname, cref-location); +node = scanRTEForColumn(pstate, rte, colname, cref-location, + NULL); if (node == NULL) { /* Try it as a function call on the whole row */ diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c index 9ebd3fd..472e15e 100644 --- a/src/backend/parser/parse_func.c +++
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 12, 2014 at 10:42 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan p...@heroku.com wrote: I'm pretty puzzled by this. Other than our agree to disagree and defer to committer position on the question of whether or not more than one suggestion can come from a single RTE, which you were fine with before [1], I have only restored the core/contrib separation to a state recently suggested by Robert as the best and simplest all around [2]. Did I miss something else? My point is: I am not sure I can be defined as a reviewer of this patch or take any credit in this patch review knowing that the latest version submitted is a simple rebase of the version I did my first review on. Hence, code speaking, this patch is in the same state as when it has been firstly submitted. Of course you can. Time spent reviewing is time spent reviewing, whether it results in changes to the patch or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 13, 2014 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote: My point is: I am not sure I can be defined as a reviewer of this patch or take any credit in this patch review knowing that the latest version submitted is a simple rebase of the version I did my first review on. Hence, code speaking, this patch is in the same state as when it has been firstly submitted. Of course you can. Time spent reviewing is time spent reviewing, whether it results in changes to the patch or not. My thoughts exactly. I thought Michael did a good job, even if I didn't agree with everything he said. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 12, 2014 at 8:00 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch moves the Levenshtein distance implementation into core. Oops. Somehow managed to send a *.patch.swp file. :-) Here is the actual patch. Committed. I changed varstr_leven() to varstr_levenshtein() because abbrvs cn mk the code hrd to undstnd. And to grep. And I removed the StaticAssertStmt you added, because it's not actually used for anything that necessitates that, yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Nov 13, 2014 at 9:36 AM, Robert Haas robertmh...@gmail.com wrote: Committed. I changed varstr_leven() to varstr_levenshtein() because abbrvs cn mk the code hrd to undstnd. And to grep. Thanks. I'll produce a revision of patch 2/2 soon. And I removed the StaticAssertStmt you added, because it's not actually used for anything that necessitates that, yet. I'll add it back in in patch 2/2, so. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote: Based on this review from a month ago, I'm going to mark this Waiting on Author. If nobody updates the patch in a few days, I'll mark it Returned with Feedback. Thanks. Attached revision fixes the compiler warning that Heikki complained about. I maintain SQL-callable stub functions from within contrib, rather than follow Michael's approach. In other words, very little has changed from my revision from July last [1]. I agree with your proposed approach to moving Levenshtein into core. However, I think this should be separated into two patches, one of them moving the Levenshtein functionality into core, and the other adding the new treatment for missing column errors. If you can do that relatively soon, I'll make an effort to get the refactoring patch committed in the near future. Once that's done, we can focus in on the interesting part of the patch, which is the actual machinery for suggesting alternatives. On that topic, I think there's unanimous consensus against the design where equally-distant matches are treated differently based on whether they are in the same RTE or different RTEs. I think you need to change that if you want to get anywhere with this. On a related note, the use of the additional parameter AttrNumber closest[2] to searchRangeTableForCol() and of the additional parameters AttrNumber *matchedatt and int *distance to scanRTEForColumn() is less than self-documenting. I suggest creating a structure called something like FuzzyAttrMatchState and passing a pointer to it down to both functions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote: I agree with your proposed approach to moving Levenshtein into core. However, I think this should be separated into two patches, one of them moving the Levenshtein functionality into core, and the other adding the new treatment for missing column errors. If you can do that relatively soon, I'll make an effort to get the refactoring patch committed in the near future. Once that's done, we can focus in on the interesting part of the patch, which is the actual machinery for suggesting alternatives. Okay, thanks. I think I can do that fairly soon. On that topic, I think there's unanimous consensus against the design where equally-distant matches are treated differently based on whether they are in the same RTE or different RTEs. I think you need to change that if you want to get anywhere with this. Alright. It wasn't as if I felt very strongly about it either way. On a related note, the use of the additional parameter AttrNumber closest[2] to searchRangeTableForCol() and of the additional parameters AttrNumber *matchedatt and int *distance to scanRTEForColumn() is less than self-documenting. I suggest creating a structure called something like FuzzyAttrMatchState and passing a pointer to it down to both functions. Sure. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote: I agree with your proposed approach to moving Levenshtein into core. However, I think this should be separated into two patches, one of them moving the Levenshtein functionality into core, and the other adding the new treatment for missing column errors. If you can do that relatively soon, I'll make an effort to get the refactoring patch committed in the near future. Once that's done, we can focus in on the interesting part of the patch, which is the actual machinery for suggesting alternatives. Okay, thanks. I think I can do that fairly soon. Attached patch moves the Levenshtein distance implementation into core. You're missing patch 2 of 2 here, because I have yet to incorporate your feedback on the HINT itself -- when I've done that, I'll post a newly rebased patch 2/2, with those items taken care of. As you pointed out, there is no reason to wait for that. -- Peter Geoghegan .0001-Move-Levenshtein-distance-implementation-into-core.patch.swp Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch moves the Levenshtein distance implementation into core. Oops. Somehow managed to send a *.patch.swp file. :-) Here is the actual patch. -- Peter Geoghegan From b7df918f1a52107637600f3b22d1cff18bd07ae1 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Sat, 30 Nov 2013 23:15:00 -0800 Subject: [PATCH 1/2] Move Levenshtein distance implementation into core The fuzzystmatch Levenshtein distance implementation is moved from /contrib to core. However, the related SQL-callable functions may still only be used with the fuzzystmatch extension installed -- the fuzzystmatch definitions become simple forwarding stubs. It is not anticipated that the user-facing SQL functions will be moved into core in the future, due to the MAX_LEVENSHTEIN_STRLEN restriction. An in-core Levenshtein distance implementation is only anticipated to be helpful in building diagnostic messages. --- contrib/fuzzystrmatch/Makefile| 3 - contrib/fuzzystrmatch/fuzzystrmatch.c | 81 --- contrib/fuzzystrmatch/levenshtein.c | 403 -- src/backend/utils/adt/Makefile| 2 + src/backend/utils/adt/levenshtein.c | 394 + src/backend/utils/adt/varlena.c | 24 ++ src/include/utils/builtins.h | 5 + 7 files changed, 481 insertions(+), 431 deletions(-) delete mode 100644 contrib/fuzzystrmatch/levenshtein.c create mode 100644 src/backend/utils/adt/levenshtein.c diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 024265d..0327d95 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -17,6 +17,3 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif - -# levenshtein.c is #included by fuzzystrmatch.c -fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c index 7a53d8a..62e650f 100644 --- a/contrib/fuzzystrmatch/fuzzystrmatch.c +++ b/contrib/fuzzystrmatch/fuzzystrmatch.c @@ -154,23 +154,6 @@ getcode(char c) /* These prevent GH from becoming F */ #define NOGHTOF(c) (getcode(c) 16) /* BDH */ -/* Faster than memcmp(), for this use case. */ -static inline bool -rest_of_char_same(const char *s1, const char *s2, int len) -{ - while (len 0) - { - len--; - if (s1[len] != s2[len]) - return false; - } - return true; -} - -#include levenshtein.c -#define LEVENSHTEIN_LESS_EQUAL -#include levenshtein.c - PG_FUNCTION_INFO_V1(levenshtein_with_costs); Datum levenshtein_with_costs(PG_FUNCTION_ARGS) @@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS) int ins_c = PG_GETARG_INT32(2); int del_c = PG_GETARG_INT32(3); int sub_c = PG_GETARG_INT32(4); - - PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c)); + const char *s_data; + const char *t_data; + int s_bytes, +t_bytes; + + /* Extract a pointer to the actual character data */ + s_data = VARDATA_ANY(src); + t_data = VARDATA_ANY(dst); + /* Determine length of each string in bytes and characters */ + s_bytes = VARSIZE_ANY_EXHDR(src); + t_bytes = VARSIZE_ANY_EXHDR(dst); + + PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c, + del_c, sub_c)); } @@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS) { text *src = PG_GETARG_TEXT_PP(0); text *dst = PG_GETARG_TEXT_PP(1); - - PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1)); + const char *s_data; + const char *t_data; + int s_bytes, +t_bytes; + + /* Extract a pointer to the actual character data */ + s_data = VARDATA_ANY(src); + t_data = VARDATA_ANY(dst); + /* Determine length of each string in bytes and characters */ + s_bytes = VARSIZE_ANY_EXHDR(src); + t_bytes = VARSIZE_ANY_EXHDR(dst); + + PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, 1, 1, + 1)); } @@ -206,8 +213,20 @@ levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS) int del_c = PG_GETARG_INT32(3); int sub_c = PG_GETARG_INT32(4); int max_d = PG_GETARG_INT32(5); - - PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d)); + const char *s_data; + const char *t_data; + int s_bytes, +t_bytes; + + /* Extract a pointer to the actual character data */ + s_data = VARDATA_ANY(src); + t_data = VARDATA_ANY(dst); + /* Determine length of each string in bytes and characters */ + s_bytes = VARSIZE_ANY_EXHDR(src); + t_bytes = VARSIZE_ANY_EXHDR(dst); + + PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes, + ins_c, del_c, sub_c, max_d)); } @@ -218,8 +237,20 @@ levenshtein_less_equal(PG_FUNCTION_ARGS) text *src = PG_GETARG_TEXT_PP(0); text *dst = PG_GETARG_TEXT_PP(1); int max_d = PG_GETARG_INT32(2); - - PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote: Reminder: I maintain a slight preference for only offering one suggestion per relation RTE, which is what this revision does (so no change there). If a committer who picks this up wants me to alter that, I don't mind doing so; since only Michael spoke up on this, I've kept things my way. Hm. The last version of this patch has not really changed since since my first review, and I have no more feedback to provide about it except what I already mentioned. I honestly don't think that this patch is ready for committer as-is... If someone wants to review it further, well extra opinions I am sure are welcome. -- Michael
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier michael.paqu...@gmail.com wrote: Hm. The last version of this patch has not really changed since since my first review, and I have no more feedback to provide about it except what I already mentioned. I honestly don't think that this patch is ready for committer as-is... If someone wants to review it further, well extra opinions I am sure are welcome. Why not? You've already said that you're happy to defer to whatever committer picks this up with regard to whether or not more than a single suggestion can come from an RTE. I agreed with this (i.e. I said I'd defer to their opinion too), and once again drew particular attention to this state of affairs alongside my most recent revision. What does that leave? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan p...@heroku.com wrote: Why not? You've already said that you're happy to defer to whatever committer picks this up with regard to whether or not more than a single suggestion can come from an RTE. I agreed with this (i.e. I said I'd defer to their opinion too), and once again drew particular attention to this state of affairs alongside my most recent revision. What does that leave? I see you've marked this Needs Review, even though your previously marked it Ready for Committer a few months back (Robert marked it Waiting on Author very recently because of the compiler warning, and then I marked it back to Ready for Committer once that was addressed, before you finally marked it back to Needs Review and removed yourself as the reviewer just now). I'm pretty puzzled by this. Other than our agree to disagree and defer to committer position on the question of whether or not more than one suggestion can come from a single RTE, which you were fine with before [1], I have only restored the core/contrib separation to a state recently suggested by Robert as the best and simplest all around [2]. Did I miss something else? [1] http://www.postgresql.org/message-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-bvqzgp0qa1y_xdc...@mail.gmail.com [2] http://www.postgresql.org/message-id/ca+tgmoykiiq8mc0uj5i5xfktybg1qqfn4yrckz60ydunumk...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote: Based on this review from a month ago, I'm going to mark this Waiting on Author. If nobody updates the patch in a few days, I'll mark it Returned with Feedback. Thanks. Attached revision fixes the compiler warning that Heikki complained about. I maintain SQL-callable stub functions from within contrib, rather than follow Michael's approach. In other words, very little has changed from my revision from July last [1]. Reminder: I maintain a slight preference for only offering one suggestion per relation RTE, which is what this revision does (so no change there). If a committer who picks this up wants me to alter that, I don't mind doing so; since only Michael spoke up on this, I've kept things my way. This is not a completion mechanism; it is supposed to work on *complete* column references with slight misspellings (e.g. incorrect use of plurals, or column references with an omitted underscore character). Weighing Tom's concerns about suggestions that are of absolute low quality is what makes me conclude that this is the thing to do. [1] http://www.postgresql.org/message-id/CAM3SWZTzQO=oy4jmfb-65iefie8ihukderk-0oljetm8dsr...@mail.gmail.com -- Peter Geoghegan From 830bf9f668972ba6b531df5d4fcbd73db3472434 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Sat, 30 Nov 2013 23:15:00 -0800 Subject: [PATCH] Levenshtein distance column HINT Add a new HINT -- a guess as to what column the user might have intended to reference, to be shown in various contexts where an ERRCODE_UNDEFINED_COLUMN error is raised. The user will see this HINT when he or she fat-fingers a column reference in his or her ad-hoc SQL query, or incorrectly pluralizes or fails to pluralize a column reference. The HINT suggests a column in the range table with the lowest Levenshtein distance, or the tied-for-best pair of matching columns in the event of there being exactly two equally likely candidates (iff each candidate column comes from a separate RTE). Limiting the cases where multiple equally likely suggestions are all offered at once is a measure against suggestions that are of low quality in an absolute sense. A further, final measure is taken against suggestions that are of low absolute quality: If the distance exceeds a normalized distance threshold, no suggestion is given. The contrib Levenshtein distance implementation is moved from /contrib to core. However, the SQL-callable functions may only be used with the fuzzystmatch extension installed, just as before -- the fuzzystmatch definitions become mere forwarding stubs. --- contrib/fuzzystrmatch/Makefile| 3 - contrib/fuzzystrmatch/fuzzystrmatch.c | 81 -- contrib/fuzzystrmatch/levenshtein.c | 403 -- src/backend/parser/parse_expr.c | 9 +- src/backend/parser/parse_func.c | 2 +- src/backend/parser/parse_relation.c | 319 --- src/backend/utils/adt/Makefile| 2 + src/backend/utils/adt/levenshtein.c | 393 + src/backend/utils/adt/varlena.c | 25 ++ src/include/parser/parse_relation.h | 3 +- src/include/utils/builtins.h | 5 + src/test/regress/expected/alter_table.out | 8 + src/test/regress/expected/join.out| 39 +++ src/test/regress/expected/plpgsql.out | 1 + src/test/regress/expected/rowtypes.out| 1 + src/test/regress/expected/rules.out | 1 + src/test/regress/expected/without_oid.out | 1 + src/test/regress/sql/join.sql | 24 ++ 18 files changed, 849 insertions(+), 471 deletions(-) delete mode 100644 contrib/fuzzystrmatch/levenshtein.c create mode 100644 src/backend/utils/adt/levenshtein.c diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 024265d..0327d95 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -17,6 +17,3 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif - -# levenshtein.c is #included by fuzzystrmatch.c -fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c index 7a53d8a..62e650f 100644 --- a/contrib/fuzzystrmatch/fuzzystrmatch.c +++ b/contrib/fuzzystrmatch/fuzzystrmatch.c @@ -154,23 +154,6 @@ getcode(char c) /* These prevent GH from becoming F */ #define NOGHTOF(c) (getcode(c) 16) /* BDH */ -/* Faster than memcmp(), for this use case. */ -static inline bool -rest_of_char_same(const char *s1, const char *s2, int len) -{ - while (len 0) - { - len--; - if (s1[len] != s2[len]) - return false; - } - return true; -} - -#include levenshtein.c -#define LEVENSHTEIN_LESS_EQUAL -#include levenshtein.c - PG_FUNCTION_INFO_V1(levenshtein_with_costs); Datum levenshtein_with_costs(PG_FUNCTION_ARGS) @@
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote: Based on this review from a month ago, I'm going to mark this Waiting on Author. If nobody updates the patch in a few days, I'll mark it Returned with Feedback. Thanks. Attached revision fixes the compiler warning that Heikki complained about. I maintain SQL-callable stub functions from within contrib, rather than follow Michael's approach. In other words, very little has changed from my revision from July last [1]. FWIW, I still find this bit of code that this patch adds in varlena.c ugly: +#include levenshtein.c +#define LEVENSHTEIN_LESS_EQUAL +#include levenshtein.c +#undef LEVENSHTEIN_LESS_EQUAL -- Michael
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sun, Nov 9, 2014 at 8:56 PM, Michael Paquier michael.paqu...@gmail.com wrote: FWIW, I still find this bit of code that this patch adds in varlena.c ugly: +#include levenshtein.c +#define LEVENSHTEIN_LESS_EQUAL +#include levenshtein.c +#undef LEVENSHTEIN_LESS_EQUAL Okay, but this is the coding that currently appears within contrib's fuzzystrmatch.c, more or less unchanged. The #undef LEVENSHTEIN_LESS_EQUAL line that I added ought to be unnecessary. I'll give the final word on that to whoever picks this up. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Oct 6, 2014 at 3:09 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 07/18/2014 10:47 AM, Michael Paquier wrote: On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan p...@heroku.com wrote: I am not opposed to moving the contrib code into core in the manner that you oppose. I don't feel strongly either way. I noticed in passing that your revision says this *within* levenshtein.c: + * Guaranteed to work with Name datatype's cstrings. + * For full details see levenshtein.c. Yeah, I looked at what I produced yesterday night again and came across a couple of similar things :) And reworked a couple of things in the version attached, mainly wordsmithing and adding comments here and there, as well as making the naming of the Levenshtein functions in core the same as the ones in fuzzystrmatch 1.0. I imagined that when a committer picked this up, an executive decision would be made one way or the other. I am quite willing to revise the patch to alter this behavior at the request of a committer. Fine for me. I'll move this patch to the next stage then. There are a bunch of compiler warnings: parse_relation.c: In function ‘errorMissingColumn’: parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized in this function [-Wmaybe-uninitialized] parse_relation.c:3066:8: note: ‘closestcol1’ was declared here parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized in this function [-Wmaybe-uninitialized] parse_relation.c:3067:8: note: ‘closestcol2’ was declared here levenshtein.c: In function ‘levenshtein_common’: levenshtein.c:107:6: warning: unused variable ‘start_column_local’ [-Wunused-variable] Based on this review from a month ago, I'm going to mark this Waiting on Author. If nobody updates the patch in a few days, I'll mark it Returned with Feedback. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On 07/18/2014 10:47 AM, Michael Paquier wrote: On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan p...@heroku.com wrote: I am not opposed to moving the contrib code into core in the manner that you oppose. I don't feel strongly either way. I noticed in passing that your revision says this *within* levenshtein.c: + * Guaranteed to work with Name datatype's cstrings. + * For full details see levenshtein.c. Yeah, I looked at what I produced yesterday night again and came across a couple of similar things :) And reworked a couple of things in the version attached, mainly wordsmithing and adding comments here and there, as well as making the naming of the Levenshtein functions in core the same as the ones in fuzzystrmatch 1.0. I imagined that when a committer picked this up, an executive decision would be made one way or the other. I am quite willing to revise the patch to alter this behavior at the request of a committer. Fine for me. I'll move this patch to the next stage then. There are a bunch of compiler warnings: parse_relation.c: In function ‘errorMissingColumn’: parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized in this function [-Wmaybe-uninitialized] parse_relation.c:3066:8: note: ‘closestcol1’ was declared here parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized in this function [-Wmaybe-uninitialized] parse_relation.c:3067:8: note: ‘closestcol2’ was declared here levenshtein.c: In function ‘levenshtein_common’: levenshtein.c:107:6: warning: unused variable ‘start_column_local’ [-Wunused-variable] - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Jul 17, 2014 at 9:34 AM, Michael Paquier michael.paqu...@gmail.com wrote: Patch 1 does a couple of things: - fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of it anymore, and moved to core. - Removal of the LESS_EQUAL flag that made the original submission patch harder to understand. All the Levenshtein functions wrap a single common function. - Documentation is moved, and regression tests for Levenshtein functions are added. - Functions with costs are renamed with a suffix with costs. After hacking this feature, I came up with the conclusion that it would be better for the user experience to move directly into backend code all the Levenshtein functions, instead of only moving in the common wrapper as Peter did in his original patches. This is done this way to avoid keeping portions of the same feature in two different places of the code (backend with common routine, fuzzystrmatch with levenshtein functions) and concentrate all the logic in a single place. Now, we may as well consider renaming the levenshtein functions into smarter names, like str_distance, and keep fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the str_distance functions. This is not cool. Anyone who is running a 9.4 or earlier database using fuzzystrmatch and upgrades, either via dump-and-restore or pg_upgrade, to a version with this patch applied will have a broken database. They will still have the catalog entries for the 1.0 definitions, but those definitions won't be resolvable inside the new cluster's .so file. The user will get a fairly-unfriendly error message that won't go away until they upgrade the extension, which may involve dealing with dependency hell since the new definitions are in a different place than the old definitions, and there may be dependencies on the old definitions. One of the great advantages of extension packaging is that this kind of problem is quite easily avoidable, so let's avoid it. There are several possible methods of doing that, but I think the best one is just to leave the SQL-callable C functions in fuzzystrmatch and move only the underlying code that supports into core. Then, the whole thing will be completely transparent to users. They won't need to upgrade their fuzzystrmatch definitions at all, and everything will just work; under the covers, the fuzzystrmatch code will now be calling into core code rather than to code located in that same module, but the user doesn't need to know or care about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Robert Haas robertmh...@gmail.com writes: There are several possible methods of doing that, but I think the best one is just to leave the SQL-callable C functions in fuzzystrmatch and move only the underlying code that supports into core. I hadn't been paying close attention to this thread, but I'd just assumed that that would be the approach. It might be worth introducing new differently-named pg_proc entries for the same functions in core, but only if we can agree that there are better names for them than what the extension uses. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 23, 2014 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote: There are several possible methods of doing that, but I think the best one is just to leave the SQL-callable C functions in fuzzystrmatch and move only the underlying code that supports into core. For some reason I thought that that was what Michael was proposing - a more comprehensive move of code into core than the structuring that I proposed. I actually thought about a Levenshtein distance operator at one point months ago, before I entirely gave up on that. The MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein distance functions are not suitable for core as is (although that doesn't matter for my purposes, since all I need is something that accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a considerable limitation for an in-core feature. I didn't get around to forming an opinion on how and if that should be fixed. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan wrote: For some reason I thought that that was what Michael was proposing - a more comprehensive move of code into core than the structuring that I proposed. I actually thought about a Levenshtein distance operator at one point months ago, before I entirely gave up on that. The MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein distance functions are not suitable for core as is (although that doesn't matter for my purposes, since all I need is something that accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a considerable limitation for an in-core feature. I didn't get around to forming an opinion on how and if that should be fixed. I had two thoughts: 1. Should we consider making levenshtein available to frontend programs as well as backend? 2. Would it provide better matching to use Damerau-Levenshtein[1] instead of raw Levenshtein? .oO(Would anyone be so bold as to attempt to implement bitap[2] using bitmapsets ...) [1] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance [2] http://en.wikipedia.org/wiki/Bitap_algorithm -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 23, 2014 at 1:10 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I had two thoughts: 1. Should we consider making levenshtein available to frontend programs as well as backend? I don't think so. Why would that be useful? 2. Would it provide better matching to use Damerau-Levenshtein[1] instead of raw Levenshtein? Maybe that would be marginally better than classic Levenshtein distance, but I doubt it would pay for itself. It's just more code to maintain. Are we really expecting to not get the best possible suggestion due to some number of transposition errors very frequently? You still have to have a worse suggestion spuriously get ahead of yours, and typically there just aren't that many to begin with. I'm not targeting spelling errors so much as thinkos around plurals and whether or not an underscore was used. Damerau-Levenshtein seems like an algorithm with fairly specialized applications. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Thu, Jul 24, 2014 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: There are several possible methods of doing that, but I think the best one is just to leave the SQL-callable C functions in fuzzystrmatch and move only the underlying code that supports into core. I hadn't been paying close attention to this thread, but I'd just assumed that that would be the approach. It might be worth introducing new differently-named pg_proc entries for the same functions in core, but only if we can agree that there are better names for them than what the extension uses. Yes, that's a point I raised upthread as well. What about renaming those functions as string_distance and string_distance_less_than? Then have only fuzzystrmatch do some DirectFunctionCall using the in-core functions? -- Michael
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan wrote: Maybe that would be marginally better than classic Levenshtein distance, but I doubt it would pay for itself. It's just more code to maintain. Are we really expecting to not get the best possible suggestion due to some number of transposition errors very frequently? You still have to have a worse suggestion spuriously get ahead of yours, and typically there just aren't that many to begin with. I'm not targeting spelling errors so much as thinkos around plurals and whether or not an underscore was used. Damerau-Levenshtein seems like an algorithm with fairly specialized applications. Yes, it's for typos. I guess it's an unfrequent scenario to have both a typoed column and a column that's missing the plural declension, which is the case in which Damerau-Lvsh would be a win. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
I am not opposed to moving the contrib code into core in the manner that you oppose. I don't feel strongly either way. I noticed in passing that your revision says this *within* levenshtein.c: + * Guaranteed to work with Name datatype's cstrings. + * For full details see levenshtein.c. On Thu, Jul 17, 2014 at 6:34 AM, Michael Paquier michael.paqu...@gmail.com wrote: Patch 2 is a rebase of the feature of Peter that can be applied on top of patch 1. The code is rather untouched (haven't much played with Peter's thingies), well-commented, but I think that this needs more work, particularly when a query has a single RTE like in this case where no hints are proposed to the user (mentioned upthread): The only source of disagreement that I am aware of at this point is the question of whether or not we should accept two candidates from the same RTE. I lean slightly towards no, as already explained [1] [2], but it's not as if I feel that strongly either way - this approach of looking for only a single best candidate per RTE taken in deference to the concerns of others. I imagined that when a committer picked this up, an executive decision would be made one way or the other. I am quite willing to revise the patch to alter this behavior at the request of a committer. [1] http://www.postgresql.org/message-id/CAM3SWZTrm4PmqMmL9=eyx-8f-vx-ha7dme4koms2vcomozg...@mail.gmail.com [2] http://www.postgresql.org/message-id/cam3swzs6kiqeqjz4pv3fkp6cgw1ws26exoqtjb_xmw3ze5b...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote: I was worried about the common case where a column name is misspelled that would otherwise be ambiguous, which is why that shows a HINT while the single RTE case doesn't To be clear - I mean a HINT with two suggestions rather than just one. If there are 3 or more equally distant suggestions (even if they're all from different RTEs) we also give no HINT in the proposed patch. Showing up to 2 hints is fine as it does not pollute the error output with perhaps unnecessary messages. That's even more protective than for example git that prints all the equidistant candidates. However I can't understand why it does not show up hints even if there are two equidistant candidates from the same RTE. I think it should. -- Michael
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 9, 2014 at 1:49 AM, Peter Geoghegan p...@heroku.com wrote: 4) This is not nice, could it be possible to remove the stuff from varlena.c? +/* Expand each Levenshtein distance variant */ +#include levenshtein.c +#define LEVENSHTEIN_LESS_EQUAL +#include levenshtein.c +#undef LEVENSHTEIN_LESS_EQUAL Part of the same comment: only varstr_leven_less_equal is used to calculate the distance, should we really move varstr_leven to core? This clearly needs to be reworked as not just a copy-paste of the things in fuzzystrmatch. The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think. So there'd be one variant within core and one within contrib/fuzzystrmatch? I don't think that's an improvement. No. The main difference between varstr_leven_less_equal and varstr_leven is the use of the extra argument max_d in the former. My argument here is instead of blindly cut-pasting into core the code you are interested in to evaluate the string distances, is to refactor it to have a unique function, and to let the business with LEVENSHTEIN_LESS_EQUAL within contrib/fuzzystrmatch. This will require some reshuffling of the distance function, but by looking at this patch I am getting the feeling that this is necessary, and should even be split into a first patch for fuzzystrmatch that would facilitate its integration into core. Also why is rest_of_char_same within varlena.c? 5) Do we want hints on system columns as well? I think it's obvious that the answer must be no. That's going to frequently result in suggestions of columns that users will complain aren't even there. If you know about the system columns, you can just get it right. They're supposed to be hidden for most purposes. This may sound ridiculous, but I have already found myself mistyping ctid by tid and cid while working on patches and modules that played with page format, and needing a couple of minutes to understand what was going on (bad morning). I would have welcomed such hints in those cases. -- Michael
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Jul 8, 2014 at 11:10 PM, Michael Paquier michael.paqu...@gmail.com wrote: Showing up to 2 hints is fine as it does not pollute the error output with perhaps unnecessary messages. That's even more protective than for example git that prints all the equidistant candidates. However I can't understand why it does not show up hints even if there are two equidistant candidates from the same RTE. I think it should. Everyone is going to have an opinion on something like that. I was showing deference to the general concern about the absolute (as opposed to relative) quality of the HINTs in the event of equidistant matches by having no two suggestions come from within a single RTE, while still covering the case I thought was important by having two suggestions if there were two equidistant matches across RTEs. I think that's marginally better then what you propose, because your case deals with two equidistant though distinct columns, bringing into question the validity of both would-be suggestions. I'll defer to whatever the consensus is. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier michael.paqu...@gmail.com wrote: So there'd be one variant within core and one within contrib/fuzzystrmatch? I don't think that's an improvement. No. The main difference between varstr_leven_less_equal and varstr_leven is the use of the extra argument max_d in the former. My argument here is instead of blindly cut-pasting into core the code you are interested in to evaluate the string distances, is to refactor it to have a unique function, and to let the business with LEVENSHTEIN_LESS_EQUAL within contrib/fuzzystrmatch. This will require some reshuffling of the distance function, but by looking at this patch I am getting the feeling that this is necessary, and should even be split into a first patch for fuzzystrmatch that would facilitate its integration into core. Also why is rest_of_char_same within varlena.c? Just as before, rest_of_char_same() exists for the express purpose of being called by the two variants varstr_leven_less_equal() and varstr_leven(). Why wouldn't I copy it over too along with those two? Where do you propose to put it? Obviously the existing macro hacks (that I haven't changed) that build the two variants are not terribly pretty, but they're not arbitrary either. They reflect the fact that there is no natural way to add callbacks or something like that. If you pretended that the core code didn't have to care about one case or the other, and that contrib was somehow obligated to hook in its own handler for the !LEVENSHTEIN_LESS_EQUAL case that it now only cares about, then you'd end up with an even bigger mess. Besides, with the patch the core code is calling varstr_leven_less_equal(), which is the bigger of the two variants - it's the LEVENSHTEIN_LESS_EQUAL case, not the !LEVENSHTEIN_LESS_EQUAL case that core cares about for the purposes of building HINTs. In short, I don't know what you mean. What would that reshuffling actually look like? 5) Do we want hints on system columns as well? I think it's obvious that the answer must be no. That's going to frequently result in suggestions of columns that users will complain aren't even there. If you know about the system columns, you can just get it right. They're supposed to be hidden for most purposes. This may sound ridiculous, but I have already found myself mistyping ctid by tid and cid while working on patches and modules that played with page format, and needing a couple of minutes to understand what was going on (bad morning). I think that it's clearly not worth it, even if it is true that a minority sometimes make this mistake. Most users don't know that there are system columns. It's not even close to being worth it to bring that into this. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan wrote: On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier michael.paqu...@gmail.com wrote: 5) Do we want hints on system columns as well? I think it's obvious that the answer must be no. That's going to frequently result in suggestions of columns that users will complain aren't even there. If you know about the system columns, you can just get it right. They're supposed to be hidden for most purposes. This may sound ridiculous, but I have already found myself mistyping ctid by tid and cid while working on patches and modules that played with page format, and needing a couple of minutes to understand what was going on (bad morning). I think that it's clearly not worth it, even if it is true that a minority sometimes make this mistake. Most users don't know that there are system columns. It's not even close to being worth it to bring that into this. I agree with Peter. This is targeted at regular users. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 9, 2014 at 2:10 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote: I was worried about the common case where a column name is misspelled that would otherwise be ambiguous, which is why that shows a HINT while the single RTE case doesn't To be clear - I mean a HINT with two suggestions rather than just one. If there are 3 or more equally distant suggestions (even if they're all from different RTEs) we also give no HINT in the proposed patch. Showing up to 2 hints is fine as it does not pollute the error output with perhaps unnecessary messages. That's even more protective than for example git that prints all the equidistant candidates. However I can't understand why it does not show up hints even if there are two equidistant candidates from the same RTE. I think it should. Me, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 9, 2014 at 7:29 AM, Peter Geoghegan p...@heroku.com wrote: Everyone is going to have an opinion on something like that. I was showing deference to the general concern about the absolute (as opposed to relative) quality of the HINTs in the event of equidistant matches by having no two suggestions come from within a single RTE, while still covering the case I thought was important by having two suggestions if there were two equidistant matches across RTEs. I think that's marginally better then what you propose, because your case deals with two equidistant though distinct columns, bringing into question the validity of both would-be suggestions. I'll defer to whatever the consensus is. I agree this is bike shedding. But as long as we're bike shedding... A simple rule is easier for users to understand as well as to code. I would humbly suggest the following: take all the unqualified column names, downcase them, check which ones match most closely the unmatched column. Show the top 3 matches if they're within some arbitrary distance. If they match exactly except for the case and the unmatched column is all lower case add a comment that quoting is required due to the mixed case. Honestly the current logic and the previous logic both seemed reasonable to me. They're not going to be perfect in every case so anything that comes up some some suggestions is fine. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark st...@mit.edu wrote: A simple rule is easier for users to understand as well as to code. I would humbly suggest the following: take all the unqualified column names, downcase them, check which ones match most closely the unmatched column. Show the top 3 matches if they're within some arbitrary distance. That's harder than it sounds. You need even more translatable strings for variant ereports(). I don't think that an easy to understand rule is necessarily of much value - I'm already charging half price for deletion because I found representative errors more useful in certain cases by doing so. I think we want something that displays the most useful suggestion as often as is practically possible, and does not display unhelpful suggestions to the extent that it's practical to avoid them. Plus, as I mentioned, I'm keen to avoid adding more stuff to scanRTEForColumn() than I already have. Honestly the current logic and the previous logic both seemed reasonable to me. They're not going to be perfect in every case so anything that comes up some some suggestions is fine. I think that the most recent revision is somewhat better due to the feedback of Tom and Robert. I didn't feel as strongly as they did about erring on the side of not showing a HINT, but I think the most recent revision is a good compromise. But yes, at this point we're certainly chasing diminishing returns. There are almost any number of variants of this basic idea that could be suggested. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 9, 2014 at 8:06 AM, Robert Haas robertmh...@gmail.com wrote: Showing up to 2 hints is fine as it does not pollute the error output with perhaps unnecessary messages. That's even more protective than for example git that prints all the equidistant candidates. However I can't understand why it does not show up hints even if there are two equidistant candidates from the same RTE. I think it should. Me, too. The idea is that each RTE gets one best suggestion, because if there are two best suggestions within an RTE they're probably both wrong. Whereas across RTEs, it's probably just that there is a foreign key relationship between the two (and the user accidentally failed to qualify the particular column of interest on top of the misspelling, a qualification that would be sufficient to have the code prefer the qualified-but-misspelled column). Clearly if I was to do what you suggest it would be closer to a wild guess, and Tom has expressed concerns about that. Now, I don't actually ensure that the column names of the two columns (each from separate RTEs) are identical save for their would-be alias, but that's just a consequence of the implementation. Also, as I've mentioned, I don't want to put more stuff in scanRTEForColumn() than I already have, due to your earlier concern about adding clutter. I think we're splitting hairs at this point, and frankly I'll do it that way if it gets the patch closer to being committed. While I thought it was important to get the unqualified and misspelled case right (which I did in the first revision, but perhaps at the expense of Tom's concern about absolute suggestion quality), I don't feel strongly about this detail either way. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan wrote: On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark st...@mit.edu wrote: A simple rule is easier for users to understand as well as to code. I would humbly suggest the following: take all the unqualified column names, downcase them, check which ones match most closely the unmatched column. Show the top 3 matches if they're within some arbitrary distance. That's harder than it sounds. You need even more translatable strings for variant ereports(). Maybe it is possible to rephrase the message so that the translatable part doesn't need to concern with how many suggestions there are. For instance something like perhaps you meant a name from the following list: foo, bar, baz. Couple with the errmsg_plural stuff, you then don't need to worry too much about providing different strings for 1, 2, N suggestions. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Wed, Jul 9, 2014 at 2:19 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: That's harder than it sounds. You need even more translatable strings for variant ereports(). Maybe it is possible to rephrase the message so that the translatable part doesn't need to concern with how many suggestions there are. For instance something like perhaps you meant a name from the following list: foo, bar, baz. Couple with the errmsg_plural stuff, you then don't need to worry too much about providing different strings for 1, 2, N suggestions. That's not really the problem. I already have a lot of things to test in each of the two ereport() calls. More importantly, showing the closet, say, 3 matches under an arbitrary distance does not weigh concerns about that indicating that they're all bad. It's not like bash tab completion - if there is one best match, that's probably because that's what the user meant. Whereas if there are two or more within a single RTE, that's probably because both are unhelpful. They both happened to require the same number of substitutions to get to, while not being quite bad enough matches to be excluded by the final check against a normalized distance threshold (the final check that prevents ludicrous suggestions). The fact that there were multiple equally plausible candidates (that are not identically named and just from different RTEs) tells us plenty, unlike with tab completion. It's not hard for one column to be a better match than another, and so it doesn't seem unreasonable to insist upon that within a single RTE where they cannot be identical, since a conservative approach seems to be what is generally favored. In any case I'm just trying to weigh everyone's concerns here. I hope it's actually possible to compromise, but right now I don't know what I can do to make useful progress. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier michael.paqu...@gmail.com wrote: 6) Sometimes no hints are returned... Even in simple cases like this one: =# create table foo (aa int, bb int); CREATE TABLE =# select ab from foo; ERROR: 42703: column ab does not exist LINE 1: select ab from foo; ^ LOCATION: errorMissingColumn, parse_relation.c:3123 In this example, it seems obvious that both aa and bb should be suggested when they are not. But what if there were far more columns, as might be expected in realistic cases (suppose all other columns have at least 3 characters)? That's another kettle of fish. The assumption that it's probably one of those two equally distant columns is now on very shaky ground. After all, the user can only have meant one particular column. If we apply a limited kind of Turing test to this second case, how does the most recent revision's algorithm do? What would a human suggest? I'm pretty sure the answer is that the human would shrug. Maybe he or she would say I guess you might have meant one of either aa or bb, but that really isn't obvious at all. That doesn't inspire much confidence. Now, maybe I should be more optimistic about it being one of the two because there are only two possibilities to begin with. That seems pretty dubious, though. In general I find it much more plausible based on what we know that the user should rethink everything. And, as Tom pointed out, showing nothing conveys something in itself once users have been trained to expect something. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sat, Jul 5, 2014 at 12:46 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote: At 2014-07-02 15:51:08 -0700, p...@heroku.com wrote: Attached revision factors in everyone's concerns here, I think. Is anyone planning to review Peter's revised patch? I have been doing some functional tests, and looked quickly at the code to understand what it does: 1) Compiles without warnings, passes regression tests 2) Checking process goes through all the existing columns of a relation even a difference of 1 with some other column(s) has already been found. As we try to limit the number of hints returned, this seems like a waste of resources. 3) distanceName could be improved, by for example having some checks on the string lengths of target and source columns, and immediately reject the match if for example the length of the source string is the double/half of the length of target. 4) This is not nice, could it be possible to remove the stuff from varlena.c? +/* Expand each Levenshtein distance variant */ +#include levenshtein.c +#define LEVENSHTEIN_LESS_EQUAL +#include levenshtein.c +#undef LEVENSHTEIN_LESS_EQUAL Part of the same comment: only varstr_leven_less_equal is used to calculate the distance, should we really move varstr_leven to core? This clearly needs to be reworked as not just a copy-paste of the things in fuzzystrmatch. The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think. 5) Do we want hints on system columns as well? For example here we could get tableoid as column hint: =# select tablepid from foo; ERROR: 42703: column tablepid does not exist LINE 1: select tablepid from foo; ^ LOCATION: errorMissingColumn, parse_relation.c:3123 Time: 0.425 ms 6) Sometimes no hints are returned... Even in simple cases like this one: =# create table foo (aa int, bb int); CREATE TABLE =# select ab from foo; ERROR: 42703: column ab does not exist LINE 1: select ab from foo; ^ LOCATION: errorMissingColumn, parse_relation.c:3123 7) Performance penalty with a table with 1600 columns: =# CREATE FUNCTION create_long_table(tabname text, columns int) RETURNS void LANGUAGE plpgsql as $$ declare first_col bool = true; count int; query text; begin query := 'CREATE TABLE ' || tabname || ' ('; for count in 0..columns loop query := query || 'col' || count || ' int'; if count columns then query := query || ', '; end if; end loop; query := query || ')'; execute query; end; $$; =# SELECT create_long_table('aa', 1599); create_long_table --- (1 row) Then tested queries like that: SELECT col888a FROM aa; Patched version: 2.100ms~2.200ms master branch (6048896): 0.956 ms~0.990 ms So the performance impact seems limited. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Hi, On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier michael.paqu...@gmail.com wrote: 2) Checking process goes through all the existing columns of a relation even a difference of 1 with some other column(s) has already been found. As we try to limit the number of hints returned, this seems like a waste of resources. In general it's possible that an exact match will later be found within the RTE, and exact matches don't have to pay the wrong alias penalty, and are immediately returned. It is therefore not a waste of resources, but even if it was that would be pretty inconsequential as your benchmark shows. 3) distanceName could be improved, by for example having some checks on the string lengths of target and source columns, and immediately reject the match if for example the length of the source string is the double/half of the length of target. I don't think it's a good idea to tie distanceName() to the ultimate behavior of errorMissingColumn() hinting, since there may be other callers in the future. Besides, that isn't going to help much. 4) This is not nice, could it be possible to remove the stuff from varlena.c? +/* Expand each Levenshtein distance variant */ +#include levenshtein.c +#define LEVENSHTEIN_LESS_EQUAL +#include levenshtein.c +#undef LEVENSHTEIN_LESS_EQUAL Part of the same comment: only varstr_leven_less_equal is used to calculate the distance, should we really move varstr_leven to core? This clearly needs to be reworked as not just a copy-paste of the things in fuzzystrmatch. The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think. So there'd be one variant within core and one within contrib/fuzzystrmatch? I don't think that's an improvement. 5) Do we want hints on system columns as well? I think it's obvious that the answer must be no. That's going to frequently result in suggestions of columns that users will complain aren't even there. If you know about the system columns, you can just get it right. They're supposed to be hidden for most purposes. 6) Sometimes no hints are returned... Even in simple cases like this one: =# create table foo (aa int, bb int); CREATE TABLE =# select ab from foo; ERROR: 42703: column ab does not exist LINE 1: select ab from foo; ^ LOCATION: errorMissingColumn, parse_relation.c:3123 That's because those two candidates come from a single RTE and have an equal distance -- you'd see both suggestions if you joined two tables with each candidate, assuming that each table being joined didn't individually have the same issue. I think that that's probably considered the correct behavior by most. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Peter Geoghegan wrote: 6) Sometimes no hints are returned... Even in simple cases like this one: =# create table foo (aa int, bb int); CREATE TABLE =# select ab from foo; ERROR: 42703: column ab does not exist LINE 1: select ab from foo; ^ LOCATION: errorMissingColumn, parse_relation.c:3123 That's because those two candidates come from a single RTE and have an equal distance -- you'd see both suggestions if you joined two tables with each candidate, assuming that each table being joined didn't individually have the same issue. I think that that's probably considered the correct behavior by most. It seems pretty silly to me actually. Was this designed by a committee? I agree with the general principle that showing a large number of candidates (a la bash) is a bad idea, but failing to show two of them ... Words fail me. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Jul 8, 2014 at 1:42 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: That's because those two candidates come from a single RTE and have an equal distance -- you'd see both suggestions if you joined two tables with each candidate, assuming that each table being joined didn't individually have the same issue. I think that that's probably considered the correct behavior by most. It seems pretty silly to me actually. Was this designed by a committee? I agree with the general principle that showing a large number of candidates (a la bash) is a bad idea, but failing to show two of them ... I guess it was designed by a committee. But we don't fail to show both because they're equally distant. Rather, it's because they're equally distant and from the same RTE. This is a contrived example, but typically showing equally distant columns is useful when they're in a foreign-key relationship - I was worried about the common case where a column name is misspelled that would otherwise be ambiguous, which is why that shows a HINT while the single RTE case doesn't. I think that in most realistic cases it wouldn't be all that useful to show two columns from the same table when they're equally distant. It's easy to imagine that reflecting that no match is good in absolute terms, and we're somewhat conservative about showing any match. While I think this general behavior is defensible, I must admit that it did suit me to write it that way because to do otherwise would have necessitated more invasive code in the existing general purpose scanRTEForColumn() function. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote: I was worried about the common case where a column name is misspelled that would otherwise be ambiguous, which is why that shows a HINT while the single RTE case doesn't To be clear - I mean a HINT with two suggestions rather than just one. If there are 3 or more equally distant suggestions (even if they're all from different RTEs) we also give no HINT in the proposed patch. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
At 2014-07-02 15:51:08 -0700, p...@heroku.com wrote: Attached revision factors in everyone's concerns here, I think. Is anyone planning to review Peter's revised patch? These new measures make the coding somewhat more complex than that of the initial version, although overall the parser code added by this patch is almost entirely confined to code paths concerned only with producing diagnostic messages to help users. Yes, the new patch looks quite a bit more involved than earlier, but if that's what it takes to provide a useful HINT, I guess it's not too bad. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sun, Jun 29, 2014 at 7:30 AM, Tom Lane t...@sss.pgh.pa.us wrote: Although printing all candidates seems to be what's preferred by existing systems with similar facilities, I can see the point that constructing the message in a translatable fashion might be difficult. So personally I'd be willing to abandon insistence on that. I still think though that printing candidates with very large distances would be unhelpful. Attached revision factors in everyone's concerns here, I think. I've addressed your concern about the closeness of the match proposed in the HINT - the absolute as opposed to relative quality of the match. There is a normalized distance threshold that must always be exceeded to prevent ludicrous suggestions. This works along similar lines to those sketched by Robert. Furthermore, I've made it occasionally possible to see 2 suggestions, when they're equally distant and when each suggestion comes from a different range table entry. However, if the two best suggestions (overall or within an RTE) come from within the same RTE, then that RTE is ignored for the purposes of picking a suggestion (although the lowest observed distance from an ignored RTE may still be used as the distance for later RTEs to beat to get their attributes suggested in the HINT). The idea here is that this quality-bar for suggestions doesn't come at the cost of ignoring my concern about the presumably somewhat common case where there is an unqualified and therefore ambiguous column reference that happens to also be misspelled. An ambiguous column reference and an incorrectly spelled column name are both very common, and so it seems likely that momentary lapses where the user gets both things wrong at once are also common. We do all this without going overboard, since as outlined by Robert, when there are 3 or more equally distant candidates (even if they all come from different RTEs), we give no HINT at all. The big picture here is to make mental context switches cheap when writing ad-hoc queries in psql. A lot of the HINTs that popped up in the regression tests that seemed kind of questionable no longer appear. These new measures make the coding somewhat more complex than that of the initial version, although overall the parser code added by this patch is almost entirely confined to code paths concerned only with producing diagnostic messages to help users. -- Peter Geoghegan levenshtein_column_hint.v2.2014_07_02.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
So, what's the status of this patch? There's been quite a lot of discussion (though only about the approach; no formal code/usage review has yet been posted), but as far as I can tell, it just tapered off without any particular consensus. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
Abhijit Menon-Sen a...@2ndquadrant.com writes: So, what's the status of this patch? There's been quite a lot of discussion (though only about the approach; no formal code/usage review has yet been posted), but as far as I can tell, it just tapered off without any particular consensus. AFAICT, people generally agree that this would probably be useful, but there's not consensus on how far the code should be willing to reach for a match, nor on what to do when there are multiple roughly-equally-plausible candidates. Although printing all candidates seems to be what's preferred by existing systems with similar facilities, I can see the point that constructing the message in a translatable fashion might be difficult. So personally I'd be willing to abandon insistence on that. I still think though that printing candidates with very large distances would be unhelpful. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers