Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2015-03-12 Thread Robert Haas
On Tue, Mar 10, 2015 at 4:03 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas robertmh...@gmail.com wrote:
 I'm prepared to commit this version if nobody finds a problem with it.
 It passes the additional regression tests you wrote.

 Looks good to me. Thanks.

OK, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2015-03-10 Thread Robert Haas
On Mon, Dec 22, 2014 at 7:34 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas robertmh...@gmail.com wrote:
 Looking over the latest patch, I think we could simplify the code so
 that you don't need multiple FuzzyAttrMatchState objects.  Instead of
 creating a separate one for each RTE and then merging them, just have
 one.  When you find an inexact-RTE name match, set a field inside the
 FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
 Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
 and pass down the same state object.  Now let
 updateFuzzyAttrMatchState() work out what it needs to do based on the
 observed inter-column distance and the currently-in-force RTE penalty.

 I'm afraid I don't follow. I think doing things that way makes things
 less clear. Merging is useful because it allows us to consider that an
 exact match might exist, which this searchRangeTableForCol() is
 already tasked with today. We now look for the best match
 exhaustively, or magically return immediately in the event of an exact
 match, without caring about the alias correctness or distance.

 Having a separate object makes this pattern apparent from the top
 level, within searchRangeTableForCol(). I feel that's better.
 updateFuzzyAttrMatchState() is the wrong place to put that, because
 that task rightfully belongs in searchRangeTableForCol(), where the
 high level diagnostic-report-generating control flow lives.

 To put it another way, creating a separate object obfuscates
 scanRTEForColumn(), since it's the only client of
 updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important
 function, and right now I am only making it slightly less clear by
 tasking it with caring about distance of names on top of strict binary
 equality of attribute names. I don't want to push it any further.

I don't buy this.  What you're essentially doing is using the
FuzzyAttrMatchState object in two ways that are not entirely
compatible with each other - updateFuzzyAttrMatchState doesn't set the
RTE fields, so searchRangeTableForCol has to do it.  So there's an
unspoken contract that in some parts of the code, you can rely on
those fields being set, and in others, you can't.  That pretty much
defeats the whole point of making the state its own object, AFAICS.
Furthermore, you end up with two copies of the state-combining logic,
one in FuzzyAttrMatchState and a second in searchRangeTableForCol.
That's ugly and unnecessary.

I decided to rework this patch myself today; my version is attached.
I believe that this version is significantly easier to understand than
yours, both as to the code and the comments.  I put quite a bit of
work into both.  I also suspect it's more efficient, because it avoids
computing the Levenshtein distances for column names when we already
know that those column names can't possibly be sufficiently-good
matches for us to care about the details; and when it does compute the
Levenshtein distance it keeps the max-distance threshold as low as
possible.  That may not really matter much, but it can't hurt.  More
importantly, essentially all of the fuzzy-matching logic is now
isolated in FuzzyAttrMatchState(); the volume of change in
scanRTEForColumn is the same as in your version, but the volume of
change in searchRangeTableForCol is quite a bit less, so the patch is
smaller overall.

I'm prepared to commit this version if nobody finds a problem with it.
It passes the additional regression tests you wrote.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


column-hint-rmh.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2015-03-10 Thread Peter Geoghegan
On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas robertmh...@gmail.com wrote:
 I'm prepared to commit this version if nobody finds a problem with it.
 It passes the additional regression tests you wrote.

Looks good to me. Thanks.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2015-02-12 Thread Michael Paquier
On Tue, Dec 23, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan p...@heroku.com wrote:
  To put it another way, creating a separate object obfuscates
  scanRTEForColumn(), since it's the only client of
  updateFuzzyAttrMatchState().


 Excuse me. I mean *not* creating a separate object -- having a unified
 state representation for the entire range-table, rather than having
 one per RTE and merging them one by one into an overall/final range
 table object.


Patch moved to CF 2015-02.
-- 
Michael


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-22 Thread Robert Haas
On Sat, Dec 20, 2014 at 7:30 PM, Peter Geoghegan p...@heroku.com wrote:
 On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch implements this scheme.

 I had another thought: NAMEDATALEN + 1 is a better representation of
 infinity for matching purposes than INT_MAX. I probably should have
 made that change, too. It would then not have been necessary to
 #include limits.h. I think that this is a useful
 belt-and-suspenders precaution against integer overflow. It almost
 certainly won't matter, since it's very unlikely that the best match
 within an RTE will end up being a dropped column, but we might as well
 do it that way (Levenshtein distance is costed in multiples of code
 point changes, but the maximum density is 1 byte per codepoint).

Good idea.

Looking over the latest patch, I think we could simplify the code so
that you don't need multiple FuzzyAttrMatchState objects.  Instead of
creating a separate one for each RTE and then merging them, just have
one.  When you find an inexact-RTE name match, set a field inside the
FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
and pass down the same state object.  Now let
updateFuzzyAttrMatchState() work out what it needs to do based on the
observed inter-column distance and the currently-in-force RTE penalty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-22 Thread Peter Geoghegan
On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas robertmh...@gmail.com wrote:
 Looking over the latest patch, I think we could simplify the code so
 that you don't need multiple FuzzyAttrMatchState objects.  Instead of
 creating a separate one for each RTE and then merging them, just have
 one.  When you find an inexact-RTE name match, set a field inside the
 FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
 Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
 and pass down the same state object.  Now let
 updateFuzzyAttrMatchState() work out what it needs to do based on the
 observed inter-column distance and the currently-in-force RTE penalty.

I'm afraid I don't follow. I think doing things that way makes things
less clear. Merging is useful because it allows us to consider that an
exact match might exist, which this searchRangeTableForCol() is
already tasked with today. We now look for the best match
exhaustively, or magically return immediately in the event of an exact
match, without caring about the alias correctness or distance.

Having a separate object makes this pattern apparent from the top
level, within searchRangeTableForCol(). I feel that's better.
updateFuzzyAttrMatchState() is the wrong place to put that, because
that task rightfully belongs in searchRangeTableForCol(), where the
high level diagnostic-report-generating control flow lives.

To put it another way, creating a separate object obfuscates
scanRTEForColumn(), since it's the only client of
updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important
function, and right now I am only making it slightly less clear by
tasking it with caring about distance of names on top of strict binary
equality of attribute names. I don't want to push it any further.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-22 Thread Peter Geoghegan
On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan p...@heroku.com wrote:
 To put it another way, creating a separate object obfuscates
 scanRTEForColumn(), since it's the only client of
 updateFuzzyAttrMatchState().


Excuse me. I mean *not* creating a separate object -- having a unified
state representation for the entire range-table, rather than having
one per RTE and merging them one by one into an overall/final range
table object.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-20 Thread Peter Geoghegan
On Tue, Dec 16, 2014 at 12:18 PM, Stephen Frost sfr...@snowman.net wrote:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 So my two cents is that when considering a qualified name, this patch
 should take levenshtein distance across the two components equally.
 There's no good reason to suppose that typos will attack one name
 component more (nor less) than the other.

 Agreed (since it seems like folks are curious for the opinion's of
 mostly bystanders).

 +1 to the above for my part.

Okay, then. Attached patch implements this scheme. It is identical to
the previous revision, except that iff there was an alias specified
and that alias does not match the correct name (alias/table name) of
the RTE currently under consideration, we charge the distance between
the differing aliases rather than a fixed distance of 1.

-- 
Peter Geoghegan
From 91087191ba49e5fed7fdfa43f98deb009c2b3e0e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Wed, 12 Nov 2014 15:31:37 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in an ad-hoc SQL query, or
incorrectly pluralizes or fails to pluralize a column reference, or
incorrectly omits or includes an underscore or other punctuation
character.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (these
may come from multiple RTEs, or the same RTE).  Limiting to two the
number of cases where multiple equally likely suggestions are all
offered at once (i.e.  giving no hint when the number of equally likely
candidates exceeds two) is a measure against suggestions that are of low
quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.
---
 src/backend/parser/parse_expr.c   |   9 +-
 src/backend/parser/parse_func.c   |   2 +-
 src/backend/parser/parse_relation.c   | 325 +++---
 src/backend/utils/adt/levenshtein.c   |   9 +
 src/include/parser/parse_relation.h   |  20 +-
 src/test/regress/expected/alter_table.out |   3 +
 src/test/regress/expected/join.out|  38 
 src/test/regress/sql/join.sql |  24 +++
 8 files changed, 392 insertions(+), 38 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..a77a3a0 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field2);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field3);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field4);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..472e15e 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 	 ((Var *) first_arg)-varno,
 	 ((Var *) first_arg)-varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..e6adee1 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
 #include postgres.h
 
 #include ctype.h
+#include limits.h
 
 #include access/htup_details.h
 #include access/sysattr.h
@@ -520,6 +521,69 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int 

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-20 Thread Peter Geoghegan
On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch implements this scheme.

I had another thought: NAMEDATALEN + 1 is a better representation of
infinity for matching purposes than INT_MAX. I probably should have
made that change, too. It would then not have been necessary to
#include limits.h. I think that this is a useful
belt-and-suspenders precaution against integer overflow. It almost
certainly won't matter, since it's very unlikely that the best match
within an RTE will end up being a dropped column, but we might as well
do it that way (Levenshtein distance is costed in multiples of code
point changes, but the maximum density is 1 byte per codepoint).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-16 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 So my two cents is that when considering a qualified name, this patch
 should take levenshtein distance across the two components equally.
 There's no good reason to suppose that typos will attack one name
 component more (nor less) than the other.

Agreed (since it seems like folks are curious for the opinion's of
mostly bystanders).

+1 to the above for my part.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-15 Thread Robert Haas
On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote:
 I think it's very possible that the wrong alias may be provided by the
 user, and that we should consider that when providing a hint.

 Note that the existing mechanism (the mechanism that I'm trying to
 improve) only ever shows this error message:

 There is a column named \%s\ in table \%s\, but it cannot be
 referenced from this part of the query.

 I think it's pretty clear that this general class of user error is common.
 Moving this patch to CF 2014-12 as work is still going on, note that
 it is currently marked with Robert as reviewer and that its current
 status is Needs review.

The status here is more like waiting around to see if anyone else has
an opinion.  The issue is what should happen when you enter qualified
name like alvaro.herrera and there is no column named anything like
herrara in the RTE named alvaro, but there is some OTHER RTE that
contains a column with a name that is only a small Levenshtein
distance away from herrera, like roberto.correra.  The questions are:

1. Should we EVER give a you-might-have-meant hint in a case like this?
2. If so, does it matter whether the RTE name is just a bit different
from the actual RTE or whether it's completely different?  In other
words, might we skip the hint in the above case but give one for
alvara.correra?

My current feeling is that we should answer #1 no, but Peter prefers
to answer it yes.  My further feeling is that if we do decide to say
yes to #1, then I would answer #2 as yes also, but Peter would
answer it no, assigning a fixed penalty for a mismatched RTE rather
than one that varies by the Levenshtein distance between the RTEs.

If no one else expresses an opinion, I'm going to insist on doing it
my way, but I'm happy to have other people weigh in.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Moving this patch to CF 2014-12 as work is still going on, note that
 it is currently marked with Robert as reviewer and that its current
 status is Needs review.

 The status here is more like waiting around to see if anyone else has
 an opinion.  The issue is what should happen when you enter qualified
 name like alvaro.herrera and there is no column named anything like
 herrara in the RTE named alvaro, but there is some OTHER RTE that
 contains a column with a name that is only a small Levenshtein
 distance away from herrera, like roberto.correra.  The questions are:

 1. Should we EVER give a you-might-have-meant hint in a case like this?
 2. If so, does it matter whether the RTE name is just a bit different
 from the actual RTE or whether it's completely different?  In other
 words, might we skip the hint in the above case but give one for
 alvara.correra?

It would be astonishingly silly to not care about the RTE name's distance,
if you ask me.  This is supposed to detect typos, not thinkos.

I think there might be some value in a separate heuristic that, when
you typed foo.bar and that doesn't match but there is a baz.bar, suggests
that maybe you meant baz.bar, even if baz is not close typo-wise.  This
would be addressing the thinko case not the typo case, so the rules ought
to be quite different --- in particular I doubt that it'd be a good idea
to hint this way if the column names don't match exactly.  But in any
case the key point is that this is a different heuristic addressing a
different failure mode.  We should not try to make the
levenshtein-distance heuristic address that case.

So my two cents is that when considering a qualified name, this patch
should take levenshtein distance across the two components equally.
There's no good reason to suppose that typos will attack one name
component more (nor less) than the other.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-14 Thread Michael Paquier
On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote:
 I think it's very possible that the wrong alias may be provided by the
 user, and that we should consider that when providing a hint.

 Note that the existing mechanism (the mechanism that I'm trying to
 improve) only ever shows this error message:

 There is a column named \%s\ in table \%s\, but it cannot be
 referenced from this part of the query.

 I think it's pretty clear that this general class of user error is common.
Moving this patch to CF 2014-12 as work is still going on, note that
it is currently marked with Robert as reviewer and that its current
status is Needs review.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-08 Thread Robert Haas
On Fri, Dec 5, 2014 at 3:45 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, if an alias is used, and you refer to an attribute using a
 non-alias name (i.e. the original table name), then you'll already get
 an error suggesting that the alias be used instead -- of course,
 that's nothing new. It doesn't matter to the existing hinting
 mechanism if the attribute name is otherwise wrong. Once you fix the
 code to use the alias suggested, you'll then get this new
 Levenshtein-based hint.

 In that case, I think I favor giving no hint at all when the RTE name
 is specified but doesn't match exactly.

 I don't follow. The existing mechanism only concerns what to do when
 the original table name was used when an alias should have been used
 instead. What does that have to do with this patch?

Just that that's the case in which it seems useful to give a hint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-08 Thread Peter Geoghegan
On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas robertmh...@gmail.com wrote:
 Just that that's the case in which it seems useful to give a hint.

I think it's very possible that the wrong alias may be provided by the
user, and that we should consider that when providing a hint. Besides,
considering every visible RTE (while penalizing non-exact alias names
iff the user provided an alias name) is actually going to make bad
hints less likely, by increasing the number of equidistant low quality
matches in a way that swamps the mechanism into providing no actual
match at all. That's an important additional protection against low
quality matches.

What do other people think here?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-08 Thread Peter Geoghegan
On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote:
 I think it's very possible that the wrong alias may be provided by the
 user, and that we should consider that when providing a hint.

Note that the existing mechanism (the mechanism that I'm trying to
improve) only ever shows this error message:

There is a column named \%s\ in table \%s\, but it cannot be
referenced from this part of the query.

I think it's pretty clear that this general class of user error is common.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-05 Thread Robert Haas
On Wed, Dec 3, 2014 at 9:21 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas robertmh...@gmail.com wrote:
 Basically, the case in which I think it's helpful to issue a
 suggestion here is when the user has used the table name rather than
 the alias name.  I wonder if it's worth checking for that case
 specifically, in lieu of what you've done here, and issuing a totally
 different hint in that case (HINT: You must refer to this as column
 as prime_minister.id rather than cameron.id).

 Well, if an alias is used, and you refer to an attribute using a
 non-alias name (i.e. the original table name), then you'll already get
 an error suggesting that the alias be used instead -- of course,
 that's nothing new. It doesn't matter to the existing hinting
 mechanism if the attribute name is otherwise wrong. Once you fix the
 code to use the alias suggested, you'll then get this new
 Levenshtein-based hint.

In that case, I think I favor giving no hint at all when the RTE name
is specified but doesn't match exactly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-05 Thread Peter Geoghegan
On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, if an alias is used, and you refer to an attribute using a
 non-alias name (i.e. the original table name), then you'll already get
 an error suggesting that the alias be used instead -- of course,
 that's nothing new. It doesn't matter to the existing hinting
 mechanism if the attribute name is otherwise wrong. Once you fix the
 code to use the alias suggested, you'll then get this new
 Levenshtein-based hint.

 In that case, I think I favor giving no hint at all when the RTE name
 is specified but doesn't match exactly.

I don't follow. The existing mechanism only concerns what to do when
the original table name was used when an alias should have been used
instead. What does that have to do with this patch?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-03 Thread Peter Geoghegan
On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas robertmh...@gmail.com wrote:
 Basically, the case in which I think it's helpful to issue a
 suggestion here is when the user has used the table name rather than
 the alias name.  I wonder if it's worth checking for that case
 specifically, in lieu of what you've done here, and issuing a totally
 different hint in that case (HINT: You must refer to this as column
 as prime_minister.id rather than cameron.id).

Well, if an alias is used, and you refer to an attribute using a
non-alias name (i.e. the original table name), then you'll already get
an error suggesting that the alias be used instead -- of course,
that's nothing new. It doesn't matter to the existing hinting
mechanism if the attribute name is otherwise wrong. Once you fix the
code to use the alias suggested, you'll then get this new
Levenshtein-based hint.

 Another idea, which I think I like less well, is to check the
 Levenshtein distance between the allowed alias and the entered alias
 and, if that's within the half-the-shorter-length threshold, consider
 possible matches from that RTE, charge the distance between the
 correct alias and the entered alias as a penalty to each potential
 column match.

I don't about that either. Aliases are often totally arbitrary,
particularly for ad-hoc queries, which is what this is aimed at.

 What I think won't do is to look at a situation where the user has
 entered automobile.id and suggest that maybe they meant student.iq, or
 even student.id.

I'm not sure I follow. If there is an automobile.ip, then it will be
suggested. If there is no automobile column that's much of a match (so
no automobile.ip, say), then student.id will be suggested (and not
student.iq, *even if there is no student.id* - the final quality check
saves us). So this is possible:

postgres=# select iq, * from student, automobile;
ERROR:  42703: column iq does not exist
LINE 1: select iq, * from student, automobile;
   ^
HINT:  Perhaps you meant to reference the column student.id.
postgres=# select automobile.iq, * from student, automobile;
ERROR:  42703: column automobile.iq does not exist
LINE 1: select automobile.iq, * from student, automobile;
   ^

(note that using the table name makes us *not* see a suggestion where
we otherwise would).

The point is that there is a fixed penalty for a wrong user-specified
alias, but all relation RTEs are considered.

 The amount of difference between the names has got to
 matter for the RTE names, just as it does for the column names.

I think it makes sense that it matters by a fixed amount. Besides,
this seems complicated enough already - I don't won't to add more
complexity to worry about equidistant (but still actually valid)
RTE/table/alias names.

It sounds like your concern here is mostly a concern about the
relative distance among multiple matches, as opposed to the absolute
quality of suggestions. The former seems a lot less controversial than
the latter was, though - the user always gets the best match, or the
join pair of best matches, or no match when this new hinting mechanism
is involved.

I attach a new revision. The revision:

* Uses default costs for Levenshtein distance.

* Still charges extra for a non-alias-matching match (although it only
charges a fixed distance of 1 extra). This has regression test
coverage.

* Applies a generic final quality check that enforces a requirement
that a hint have a distance of no greater than 50% of the total string
size. No special treatment of shorter strings is involved anymore.

* Moves almost everything out of scanRTEForColumn() as you outlined
(into a new function, updateFuzzyAttrMatchState(), per your
suggestion).

* Moves dropped column detection into updateFuzzyAttrMatchState(), per
your suggestion.

* Still does the if (rte-rtekind == RTE_JOIN) thing in the existing
function searchRangeTableForCol().

I am quite confident that a suggestion from a join RTE will never be
useful, to either the existing use of searchRangeTableForCol() or this
expanded use, and it makes more sense to me to put it there. In fact,
the existing use of searchRangeTableForCol() is really rather similar
to this, and will give up on the first identical match (which is taken
as evidence that there is a attribute of that name, but isn't visible
at this level of the query). So I have not followed your suggestion
here.

Thoughts?
-- 
Peter Geoghegan
From 81c7b0691e9d03c1bdd99f4b264737306d1bd2cf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Wed, 12 Nov 2014 15:31:37 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in an ad-hoc SQL query, or
incorrectly pluralizes or fails to pluralize a column reference, or
incorrectly omits or 

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-02 Thread Robert Haas
On Tue, Nov 25, 2014 at 7:13 PM, Peter Geoghegan p...@heroku.com wrote:
 Alright, so let me summarize what I think are the next steps in
 working towards getting this patch committed. I should produce a new
 revision which:

 * Uses default costings.

 * Applies a generic final quality check that enforces a distance of no
 greater than 50% of the total string size. (The use of default
 costings removes any reason to continue to do this)

 * Work through Robert's suggestions on other aspects that need work
 [1], most of which I already agreed to.

Sounds good so far.

 What is unclear is whether or not I should continue to charge extra
 for non-matching user supplied alias (and, I think more broadly,
 consider multiple RTEs iff the user did use an alias) - Robert was
 skeptical, but didn't seem to have made his mind up. I still think I
 should cost things based on aliases, and consider multiple RTEs even
 when the user supplied an alias (the penalty should just be a distance
 of 1 and not 3, though, in light of other changes to the
 weighing/costing). If I don't hear anything in the next day or two,
 I'll more or less preserve aliases-related aspects of the patch.

Basically, the case in which I think it's helpful to issue a
suggestion here is when the user has used the table name rather than
the alias name.  I wonder if it's worth checking for that case
specifically, in lieu of what you've done here, and issuing a totally
different hint in that case (HINT: You must refer to this as column
as prime_minister.id rather than cameron.id).

Another idea, which I think I like less well, is to check the
Levenshtein distance between the allowed alias and the entered alias
and, if that's within the half-the-shorter-length threshold, consider
possible matches from that RTE, charge the distance between the
correct alias and the entered alias as a penalty to each potential
column match.

What I think won't do is to look at a situation where the user has
entered automobile.id and suggest that maybe they meant student.iq, or
even student.id. The amount of difference between the names has got to
matter for the RTE names, just as it does for the column names.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-30 Thread Peter Geoghegan
On Tue, Nov 25, 2014 at 4:13 PM, Peter Geoghegan p...@heroku.com wrote:
 If I don't hear anything in the next day or two,
 I'll more or less preserve aliases-related aspects of the patch.

FYI, I didn't go ahead and work on this, because I thought that the
thanksgiving holiday in the US probably kept you from giving feedback.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Robert Haas
On Wed, Nov 19, 2014 at 2:33 PM, Peter Geoghegan p...@heroku.com wrote:
 I don't think that's the case. Other RTEs are penalized for having
 non-matching aliases here.

The point I made did not, as far as I can see, have anything to do
with non-matching aliases; it could arise with just a single RTE.

 In general, I think the cost of a bad suggestion is much lower than
 the benefit of a good one. You seem to be suggesting that they're
 equal. Or that they're equally likely in an organic situation. In my
 estimation, this is not the case at all.

The way I see it, the main cost of a bad suggestion is that it annoys
the user with clutter which they may brand as stupid.  Think about
how much vitriol has been spewed over the years against progress bars
(or estimated completion) times that don't turn out to mirror reality.
Microsoft has gotten more cumulative flack about their inaccurate
progress bars over the years than they would have for dropping an
elevator on a cute baby.  Or think about how annoying it is when a
spell-checker or grammar-checker underlines something you've written
that is, in your own opinion, correctly spelled or grammatical.  Maybe
that kind of thing doesn't annoy you, but it definitely annoys me, and
I think probably a lot of other people.  My general experience is that
people get quite pissed off by bad suggestions from a computer.  At
least in my experience, users' actual level of agitation is often all
out of proportion to what might seem justified, but we are designing
this software for actual users, so their likely emotional reactions
are relevant.

 I'm curious about your thoughts on the compromise of a ramped up
 distance threshold to apply a test for the absolute quality of a
 match. I think that the fact that git gives bad suggestions with terse
 strings tells us a lot, though. Note that unlike git, with terse
 strings we may well have a good deal more equidistant matches, and as
 soon as the number of would-be matches exceeds 2, we actually give no
 matches at all. So that's an additional protection against poor
 matches with terse strings.

I don't know what you mean by a ramped-up distance threshold, exactly.
I think it's good for the distance threshold to be lower for small
strings and higher for large ones.  I think I'm somewhat open to
negotiation on the details, but I think any system that's going to
suggest quantity for tit is going too far.  If the user types
qty when they meant quantity, they probably don't really need the
hint, because they're going to say to themselves wait, I guess I
didn't abbreviate that.  The time when they need the hint is when
they typed quanttiy, because it's quite possible to read a query
with that sort of typo multiple times and not realize that you've made
one.  You're sitting there puzzling over where the quantity column
went, and asking yourselves how you can be mis-remembering the schema,
and saying wait, didn't I just see that column in the \d output ...
and you don't even think to check carefully for a spelling mistake.
The hint may well clue you in to what the real problem is.

In other words, I think there's value in trying to clue somebody in
when they've made a typo, but not when they've made a think-o.  We
won't be able to do the latter accurately enough to make it more
useful than annoying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 ... In other words, I think there's value in trying to clue somebody in
 when they've made a typo, but not when they've made a think-o.  We
 won't be able to do the latter accurately enough to make it more
 useful than annoying.

FWIW, I concur with Robert's analysis that wrong suggestions are likely to
be annoying.  We should be erring on the side of not making a suggestion
rather than making one that's a low-probability guess.

I'm not particularly convinced that the f1 - f2 example is a useful
behavior, and I'm downright horrified by the qty - quantity case.
If the hint mechanism thinks the latter two are close enough together
to suggest, it's going to be spewing a whole lot of utterly ridiculous
suggestions.  I'm going to be annoyed way more times than I'm going to
be helped.

The big picture is that this is more or less our first venture into
heuristic suggestions.  I think we should start slow with a very
conservative set of heuristics.  If it's a success maybe we can get more
aggressive over time --- but if we go over the top here, the entire
concept will be discredited in this community for the next ten years.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Peter Geoghegan
On Thu, Nov 20, 2014 at 8:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm not particularly convinced that the f1 - f2 example is a useful
 behavior, and I'm downright horrified by the qty - quantity case.
 If the hint mechanism thinks the latter two are close enough together
 to suggest, it's going to be spewing a whole lot of utterly ridiculous
 suggestions.  I'm going to be annoyed way more times than I'm going to
 be helped.

I happen to think that that isn't the case, because the number of
possible suggestions is fairly low anyway, and people don't tend to
make those kind of errors. Robert's examples of ridiculous
suggestions of quantity based on three letter strings other than
qty (e.g. tit) were rather contrived. In fact, most 3 letter
strings will not offer a suggestion. 3 or more Equidistant would-be
matches tend to offer a lot of additional protection against bad
suggestions for these terse strings.

 The big picture is that this is more or less our first venture into
 heuristic suggestions.  I think we should start slow with a very
 conservative set of heuristics.  If it's a success maybe we can get more
 aggressive over time --- but if we go over the top here, the entire
 concept will be discredited in this community for the next ten years.

I certainly see your point here. It's not as if we have an *evolved*
understanding of the usability issues. Besides, as Robert pointed out,
most of the value of this patch is added by simple cases, like a
failure to pluralize or not pluralize, or the omission of an
underscore.

I still think we should charge half for deletion, but I will concede
that it's prudent to apply a more restrictive absolute quality final
test.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Peter Geoghegan
On Thu, Nov 20, 2014 at 7:32 AM, Robert Haas robertmh...@gmail.com wrote:
 In general, I think the cost of a bad suggestion is much lower than
 the benefit of a good one. You seem to be suggesting that they're
 equal. Or that they're equally likely in an organic situation. In my
 estimation, this is not the case at all.

 The way I see it, the main cost of a bad suggestion is that it annoys
 the user with clutter which they may brand as stupid.  Think about
 how much vitriol has been spewed over the years against progress bars
 (or estimated completion) times that don't turn out to mirror reality.

Well, you can judge the quality of the suggestion immediately. I
imagined a mechanism that gives a little bit more than the minimum
amount of guidance for things like contractions/abbreviations.

 Microsoft has gotten more cumulative flack about their inaccurate
 progress bars over the years than they would have for dropping an
 elevator on a cute baby.

I haven't used a more recent version of Windows than Windows Vista,
but I'm pretty sure that they kept it up.

 I'm curious about your thoughts on the compromise of a ramped up
 distance threshold to apply a test for the absolute quality of a
 match. I think that the fact that git gives bad suggestions with terse
 strings tells us a lot, though. Note that unlike git, with terse
 strings we may well have a good deal more equidistant matches, and as
 soon as the number of would-be matches exceeds 2, we actually give no
 matches at all. So that's an additional protection against poor
 matches with terse strings.

 I don't know what you mean by a ramped-up distance threshold, exactly.
 I think it's good for the distance threshold to be lower for small
 strings and higher for large ones.  I think I'm somewhat open to
 negotiation on the details, but I think any system that's going to
 suggest quantity for tit is going too far.

I mean the suggestion of raising the cost threshold more gradually,
not as a step function of the number of characters in the string [1]
where it's either over 6 characters and must pass the 50% test, or
isn't and has no absolute quality test. The exact modification I
described will FWIW remove the quantity for qty suggestion, as
well as all the similar suggestions that you found objectionable (like
tit also offering a suggestion of quantity).

If you look at the regression tests, none of the sensible suggestions
are lost (some would be by an across the board 50% absolute quality
threshold, as I previously pointed out [2]), but all the bad ones are.
I attach failed regression test output showing the difference between
the previous expected values, and actual values with that small
modification - it looks like most or all bad cases are now fixed.

 If the user types
 qty when they meant quantity, they probably don't really need the
 hint, because they're going to say to themselves wait, I guess I
 didn't abbreviate that.  The time when they need the hint is when
 they typed quanttiy, because it's quite possible to read a query
 with that sort of typo multiple times and not realize that you've made
 one.

I agree that that's a more important case.

 In other words, I think there's value in trying to clue somebody in
 when they've made a typo, but not when they've made a think-o.  We
 won't be able to do the latter accurately enough to make it more
 useful than annoying.

That's certainly true; I think that we only disagree about the exact
point at which we enter the think-o correction business.

[1] 
http://www.postgresql.org/message-id/CAM3SWZT+7hH29Go6ZuY2OrCS40=6ypvm_nt9njfovp3xwji...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/cam3swztsgoknht8rk+0eed7spnjg4padmbqqyi0fh9bwcnv...@mail.gmail.com
-- 
Peter Geoghegan


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Robert Haas
On Thu, Nov 20, 2014 at 1:30 PM, Peter Geoghegan p...@heroku.com wrote:
 I mean the suggestion of raising the cost threshold more gradually,
 not as a step function of the number of characters in the string [1]
 where it's either over 6 characters and must pass the 50% test, or
 isn't and has no absolute quality test. The exact modification I
 described will FWIW remove the quantity for qty suggestion, as
 well as all the similar suggestions that you found objectionable (like
 tit also offering a suggestion of quantity).

 If you look at the regression tests, none of the sensible suggestions
 are lost (some would be by an across the board 50% absolute quality
 threshold, as I previously pointed out [2]), but all the bad ones are.
 I attach failed regression test output showing the difference between
 the previous expected values, and actual values with that small
 modification - it looks like most or all bad cases are now fixed.

That does seem to give better results, but it still seems awfully
complicated.  If we just used Levenshtein with all-default cost
factors and a distance cap equal to Max(strlen(what_user_typed),
strlen(candidate_match), 3), what cases that you think are important
would be harmed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Peter Geoghegan
On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote:
 That does seem to give better results, but it still seems awfully
 complicated.  If we just used Levenshtein with all-default cost
 factors and a distance cap equal to Max(strlen(what_user_typed),
 strlen(candidate_match), 3), what cases that you think are important
 would be harmed?

Well, just by plugging in default Levenshtein cost factors, I can see
the following regression:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
10:17:55.042291912 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
11:42:15.670108745 -0800
***
*** 3452,3458 
  ERROR:  column atts.relid does not exist
  LINE 1: select atts.relid::regclass, s.* from pg_stats s join
 ^
- HINT:  Perhaps you meant to reference the column atts.indexrelid.

Within the catalogs, the names of attributes are prefixed as a form of
what you might call internal namespacing. For example, pg_index has
attributes that all begin with ind*. You could easily omit something
like that, while still more or less knowing what you're looking for.

In more concrete terms, this gets no suggestion:

postgres=# select key from pg_index;
ERROR:  42703: column key does not exist
LINE 1: select key from pg_index;
   ^

Only this does:

postgres=# select ikey from pg_index;
ERROR:  42703: column ikey does not exist
LINE 1: select ikey from pg_index;
   ^
HINT:  Perhaps you meant to reference the column pg_index.indkey.
postgres=#

The git people varied their Levenshtein costings for a reason.

I also think that a one size fits all cap will break things. It will
independently break the example above, as well as the more marginal
c1.f2. vs c1.f2 case (okay, maybe that case was exactly on the
threshold, but others won't be).

I don't see that different costings actually saves any complexity.
Similarly, the final cap is quite straightforward. Anything with any
real complexity happens before that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Robert Haas
On Thu, Nov 20, 2014 at 3:00 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote:
 That does seem to give better results, but it still seems awfully
 complicated.  If we just used Levenshtein with all-default cost
 factors and a distance cap equal to Max(strlen(what_user_typed),
 strlen(candidate_match), 3), what cases that you think are important
 would be harmed?

 Well, just by plugging in default Levenshtein cost factors, I can see
 the following regression:

 *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
 10:17:55.042291912 -0800
 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
 11:42:15.670108745 -0800
 ***
 *** 3452,3458 
   ERROR:  column atts.relid does not exist
   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
  ^
 - HINT:  Perhaps you meant to reference the column atts.indexrelid.

 Within the catalogs, the names of attributes are prefixed as a form of
 what you might call internal namespacing. For example, pg_index has
 attributes that all begin with ind*. You could easily omit something
 like that, while still more or less knowing what you're looking for.

 In more concrete terms, this gets no suggestion:

 postgres=# select key from pg_index;
 ERROR:  42703: column key does not exist
 LINE 1: select key from pg_index;
^

 Only this does:

 postgres=# select ikey from pg_index;
 ERROR:  42703: column ikey does not exist
 LINE 1: select ikey from pg_index;
^
 HINT:  Perhaps you meant to reference the column pg_index.indkey.
 postgres=#

Seems fine to me.  If you typed relid rather than indexrelid or key
rather than indkey, that's a thinko, not a typo.  ikey for indkey
could plausible be a typo, though you'd have to be having a fairly bad
day at the keyboard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Peter Geoghegan
On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas robertmh...@gmail.com wrote:
 Seems fine to me.  If you typed relid rather than indexrelid or key
 rather than indkey, that's a thinko, not a typo.  ikey for indkey
 could plausible be a typo, though you'd have to be having a fairly bad
 day at the keyboard.

I can tell that I have no chance of convincing you otherwise. While I
think you're mistaken to go against the precedent set by git, you're
the one with the commit bit, and I think we've already spent enough
time discussing this. So default costings it is.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Andres Freund
On 2014-11-20 12:00:51 -0800, Peter Geoghegan wrote:
 In more concrete terms, this gets no suggestion:
 
 postgres=# select key from pg_index;
 ERROR:  42703: column key does not exist
 LINE 1: select key from pg_index;
^

I don't think that's a bad thing. Yes, for a human those look pretty
similar, but it's easy to construct cases where that gives completely
hilarious results.

I think something simplistic like levenshtein, even with modified
distances, is good to catch typos. But not to find terms that are
related in more complex ways.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote:
 That does seem to give better results, but it still seems awfully
 complicated.  If we just used Levenshtein with all-default cost
 factors and a distance cap equal to Max(strlen(what_user_typed),
 strlen(candidate_match), 3), what cases that you think are important
 would be harmed?

 Well, just by plugging in default Levenshtein cost factors, I can see
 the following regression:

 *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
 10:17:55.042291912 -0800
 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
 11:42:15.670108745 -0800
 ***
 *** 3452,3458 
   ERROR:  column atts.relid does not exist
   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
  ^
 - HINT:  Perhaps you meant to reference the column atts.indexrelid.


I do not have a problem with deciding that that is not a regression;
in fact, not giving that hint seems like a good conservative behavior
here.  By your logic, we should also be prepared to suggest
supercalifragilisticexpialidocious when the user enters ocious.
It's simply a bridge too far for what is supposed to be a hint for
minor typos.  You sound like you want to turn it into something that
will look up column names for people who are too lazy to even try to
type the right thing.  While I can see the value of such a tool within
certain contexts, firing completed queries at a live SQL engine
is not one of them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Peter Geoghegan
On Thu, Nov 20, 2014 at 12:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I do not have a problem with deciding that that is not a regression;
 in fact, not giving that hint seems like a good conservative behavior
 here.  By your logic, we should also be prepared to suggest
 supercalifragilisticexpialidocious when the user enters ocious.

That's clearly not true. I just want to be a bit more forgiving of
omissions. That clearly isn't my logic, since that isn't a suggestion
that the implementation will give, or would be anywhere close to
giving - my weighing of deletions is only twice that of substitutions
or insertions, not ten times. git does not use Levenshtein default
costings either.

 It's simply a bridge too far for what is supposed to be a hint for
 minor typos.

Minor typos and minor omissions. My example was on the edge of what
would be tolerable under my proposed cost model.

 You sound like you want to turn it into something that
 will look up column names for people who are too lazy to even try to
 type the right thing.  While I can see the value of such a tool within
 certain contexts, firing completed queries at a live SQL engine
 is not one of them.

It's just a hint; a convenience. Users who imagine that it takes away
the need for putting any thought into their SQL queries have bigger
problems.

Anyway, that's all that needs to be said on that, since I've already
given up on a non-default costing. Also, we have default costing, and
we always apply a 50% standard (I see no point in doing otherwise with
default costings).

Robert: Where does that leave us? What about suggestions across RTEs?
Alias costing, etc?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Robert Haas
On Thu, Nov 20, 2014 at 3:20 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas robertmh...@gmail.com wrote:
 Seems fine to me.  If you typed relid rather than indexrelid or key
 rather than indkey, that's a thinko, not a typo.  ikey for indkey
 could plausible be a typo, though you'd have to be having a fairly bad
 day at the keyboard.

 I can tell that I have no chance of convincing you otherwise. While I
 think you're mistaken to go against the precedent set by git, you're
 the one with the commit bit, and I think we've already spent enough
 time discussing this. So default costings it is.

I've got a few +1s, too, if you notice.

I'm willing to be outvoted, but not by a majority of one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Peter Geoghegan
On Thu, Nov 20, 2014 at 12:50 PM, Robert Haas robertmh...@gmail.com wrote:
 I've got a few +1s, too, if you notice.

Then maybe I spoke too soon.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Robert Haas
On Tue, Nov 18, 2014 at 8:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Geoghegan p...@heroku.com writes:
 On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 HINT:  Perhaps you meant to reference the column orderlines.quantity.

 I don't buy this example, because it would give you the same hint if
 you told it you wanted to access a column called ant, or uay, or tit.
 And that's clearly ridiculous.  The reason why quantity looks like a
 reasonable suggestion for qty is because it's a conventional
 abbreviation, but an extremely high percentage of comparable cases
 won't be.

 I maintain that omission of part of the correct spelling should be
 weighed less.

 I would say that omission of the first letter should completely disqualify
 suggestions based on this heuristic; but it might make sense to weight
 omissions less after the first letter.

I think we would be well-advised not to start inventing our own
approximate matching algorithm.  Peter's suggestion boils down to a
guess that the default cost parameters for Levenshtein suck, and your
suggestion boils down to a guess that we can fix the problems with
Peter's suggestion by bolting another heuristic on top of it - and
possibly running Levenshtein twice with different sets of cost
parameters.  Ugh.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas robertmh...@gmail.com wrote:
 I think we would be well-advised not to start inventing our own
 approximate matching algorithm.  Peter's suggestion boils down to a
 guess that the default cost parameters for Levenshtein suck, and your
 suggestion boils down to a guess that we can fix the problems with
 Peter's suggestion by bolting another heuristic on top of it - and
 possibly running Levenshtein twice with different sets of cost
 parameters.  Ugh.

I agree.

While I am perfectly comfortable with the fact that we are guessing
here, my guesses are based on what I observed to work well with real
schemas, and simulated errors that I thought were representative of
human error. Obviously it's possible that another scheme will do
better sometimes, including for example a scheme that picks a match
entirely at random. But on average, I think that what I have here will
do better than anything else proposed so far.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Robert Haas
On Wed, Nov 19, 2014 at 12:33 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas robertmh...@gmail.com wrote:
 I think we would be well-advised not to start inventing our own
 approximate matching algorithm.  Peter's suggestion boils down to a
 guess that the default cost parameters for Levenshtein suck, and your
 suggestion boils down to a guess that we can fix the problems with
 Peter's suggestion by bolting another heuristic on top of it - and
 possibly running Levenshtein twice with different sets of cost
 parameters.  Ugh.

 I agree.

 While I am perfectly comfortable with the fact that we are guessing
 here, my guesses are based on what I observed to work well with real
 schemas, and simulated errors that I thought were representative of
 human error. Obviously it's possible that another scheme will do
 better sometimes, including for example a scheme that picks a match
 entirely at random. But on average, I think that what I have here will
 do better than anything else proposed so far.

If you agree, then I'm not being clear enough.  I don't think think
that tinkering with the Levenshtein cost factors is a good idea, and I
think it's unhelpful to suggest something when the suggestion and the
original word differ by more than 50% of the number characters in the
shorter word.  Suggesting col for oid or x for xmax, as crops
up in the regression tests with this patch applied, shows the folly of
this: the user didn't mean the other named column; rather, the user
was confused about whether a particular system column existed for that
table.

If we had a large database of examples showing what the user typed and
what they intended, we could try different algorithms against it and
see which one performs best with fewest false positives.  But if we
don't have that, we should do things that are like the things that
other people have done before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas robertmh...@gmail.com wrote:
 If you agree, then I'm not being clear enough.  I don't think think
 that tinkering with the Levenshtein cost factors is a good idea, and I
 think it's unhelpful to suggest something when the suggestion and the
 original word differ by more than 50% of the number characters in the
 shorter word.

I agree - except for very short strings, where there is insufficient
information to go on.

I was talking about the difficulty of bolting something on top of
Levenshtein distance that looked at the first character. That would
not need two costings, but would require an encoding aware matching of
the first code point.

 Suggesting col for oid or x for xmax, as crops
 up in the regression tests with this patch applied, shows the folly of
 this: the user didn't mean the other named column; rather, the user
 was confused about whether a particular system column existed for that
 table.

Those are all very terse strings. What you're overlooking is what is
broken by using straight Levenshtein distance, which includes things
in the regression test that are reasonable and helpful. As I mentioned
before, requiring a greater than 50% of total string size distance
breaks this, just within the regression tests:


  ERROR:  column f1 does not exist
  LINE 1: select f1 from c1;
 ^
- HINT:  Perhaps you meant to reference the column c1.f2.


And:


  ERROR:  column atts.relid does not exist
  LINE 1: select atts.relid::regclass, s.* from pg_stats s join
 ^
- HINT:  Perhaps you meant to reference the column atts.indexrelid.


Those are really useful suggestions! And, they're much more
representative of real user error.

The downside of weighing deletion less than substitution and insertion
is much smaller than the upside. It's worth it. The downside is only
that the user gets to see the best suggestion that isn't all that good
in an absolute sense (which we have a much harder time concluding
using simple tests for short misspellings).

 If we had a large database of examples showing what the user typed and
 what they intended, we could try different algorithms against it and
 see which one performs best with fewest false positives.  But if we
 don't have that, we should do things that are like the things that
 other people have done before.

That seems totally impractical. No one has that kind of data that I'm aware of.

How about git as a kind of precedent? It is not at all conservative
about showing *some* suggestion:


$ git aa
git: 'aa' is not a git command. See 'git --help'.

Did you mean this?
am

$ git d
git: 'd' is not a git command. See 'git --help'.

Did you mean one of these?
diff
add

$ git ddd
git: 'ddd' is not a git command. See 'git --help'.

Did you mean this?
add


And why wouldn't git be? As far as its concerned, you can only have
meant one of those small number of things. Similarly, with the patch,
the number of things we can pick from is fairly limited at this stage,
since we are actually fairly far along with parse analysis.

Now, this won't give a suggestion:


$ git 
git: '' is not a git command. See 'git --help'.


So it looks like git similarly weighs deletion less than
insertion/substitution. Are its suggestions any less ridiculous than
your examples of questionable hints from the modified regression test
expected output? This is just a guidance mechanism, and at worst we'll
show the best match (on the mechanisms own terms, which isn't too
bad).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 10:22 AM, Peter Geoghegan p...@heroku.com wrote:
 Those are all very terse strings. What you're overlooking is what is
 broken by using straight Levenshtein distance, which includes things
 in the regression test that are reasonable and helpful. As I mentioned
 before, requiring a greater than 50% of total string size distance
 breaks this, just within the regression tests:

Maybe you'd prefer if there was a more gradual ramp-up to requiring a
distance of no greater than 50% of the string size (normalized to take
account of my non-default costings). Right now it's a step function of
the number of characters in the string - there is no absolute
quality requirement for strings of 6 or fewer requirements.
Otherwise, there is the 50% distance absolute quality test (the test
that you want to be applied generally). I think that would be better,
without being much more complicated.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan p...@heroku.com wrote:
 there is no absolute
 quality requirement for strings of 6 or fewer requirements.

I meant 6 or fewer *characters*, obviously.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan p...@heroku.com wrote:
 Maybe you'd prefer if there was a more gradual ramp-up to requiring a
 distance of no greater than 50% of the string size (normalized to take
 account of my non-default costings)

I made this modification:

diff --git a/src/backend/parser/parse_relation.c
b/src/backend/parser/parse_relation.c
index 40c69d7..cca075f 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -929,7 +929,8 @@ searchRangeTableForCol(ParseState *pstate, const
char *alias, char *colname,
 * seen when 6 deletions are required against actual attribute
name, or 3
 * insertions/substitutions.
 */
-   if (state-distance  6  state-distance  strlen(colname) / 2)
+   if ((state-distance  3  state-distance  strlen(colname)) ||
+   (state-distance  6  state-distance  strlen(colname) / 2))
{
state-rsecond = state-rfirst = NULL;
state-second = state-first = InvalidAttrNumber;


When I run the regression tests now, then all the cases that you found
objectionable in the regression tests' previous expected output
disappear, while all the cases I think are useful that were previously
removed by applying a broad 50% standard remain. While I'm not 100%
sure that this exact formulation is the best one, I think that we can
reach a compromise on this point, that allows the costing to remain
the same without offering particularly bad suggestions for short
strings.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Robert Haas
On Wed, Nov 19, 2014 at 1:22 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas robertmh...@gmail.com wrote:
 If you agree, then I'm not being clear enough.  I don't think think
 that tinkering with the Levenshtein cost factors is a good idea, and I
 think it's unhelpful to suggest something when the suggestion and the
 original word differ by more than 50% of the number characters in the
 shorter word.

 I agree - except for very short strings, where there is insufficient
 information to go on.

That's precisely the time I think it's *most* important.  In a very
long string, the threshold should be LESS than 50%.  My original
proposal was no more than 3 characters of difference, but in any
event not more than half the length of the shorter string.

 Suggesting col for oid or x for xmax, as crops
 up in the regression tests with this patch applied, shows the folly of
 this: the user didn't mean the other named column; rather, the user
 was confused about whether a particular system column existed for that
 table.

 Those are all very terse strings. What you're overlooking is what is
 broken by using straight Levenshtein distance, which includes things
 in the regression test that are reasonable and helpful. As I mentioned
 before, requiring a greater than 50% of total string size distance
 breaks this, just within the regression tests:

 
   ERROR:  column f1 does not exist
   LINE 1: select f1 from c1;
  ^
 - HINT:  Perhaps you meant to reference the column c1.f2.
 

That's exactly 50%, not more than 50%.

(I'm also on the fence about whether the hint is actually helpful in
that case, but the rule I proposed wouldn't prohibit it.)

 And:

 
   ERROR:  column atts.relid does not exist
   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
  ^
 - HINT:  Perhaps you meant to reference the column atts.indexrelid.
 

 Those are really useful suggestions! And, they're much more
 representative of real user error.

That one's right at 50% too, but it's certainly more than 3 characters
of difference.  I think it's going to be pretty hard to emit a
suggestion in that case but not in a whole lot of cases that don't
make any sense.

 How about git as a kind of precedent? It is not at all conservative
 about showing *some* suggestion:

 
 $ git aa
 git: 'aa' is not a git command. See 'git --help'.

 Did you mean this?
 am

 $ git d
 git: 'd' is not a git command. See 'git --help'.

 Did you mean one of these?
 diff
 add

 $ git ddd
 git: 'ddd' is not a git command. See 'git --help'.

 Did you mean this?
 add
 

 And why wouldn't git be? As far as its concerned, you can only have
 meant one of those small number of things. Similarly, with the patch,
 the number of things we can pick from is fairly limited at this stage,
 since we are actually fairly far along with parse analysis.

I went and found the actual code git uses for this.  It's here:

https://github.com/git/git/blob/d29e9c89dbbf0876145dc88615b99308cab5f187/help.c

And the underlying Levenshtein implementation is here:

https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c

Apparently what they're doing is charging 0 for a transposition (which
we don't have as a separate concept), 2 for a substitution, 1 for an
insertion, and 3 for a deletion, with the constraint that anything
with a total distance of more than 6 isn't considered.  And that does
overall seem to give pretty good suggestions.  However, an interesting
point about the git algorithm is that it's not hard to make it do
stupid things on short strings:

[rhaas pgsql]$ git xy
git: 'xy' is not a git command. See 'git --help'.

Did you mean one of these?
am
gc
mv
rm

Maybe they should adopt my idea of a lower cutoff for short strings.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote:
 Apparently what they're doing is charging 0 for a transposition (which
 we don't have as a separate concept), 2 for a substitution, 1 for an
 insertion, and 3 for a deletion, with the constraint that anything
 with a total distance of more than 6 isn't considered.  And that does
 overall seem to give pretty good suggestions.

The git people know that no git command is longer than 4 or 5
characters. That doesn't apply to us. I certainly would not like to
have an absolute distance test of n characters.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote:
 That's precisely the time I think it's *most* important.  In a very
 long string, the threshold should be LESS than 50%.  My original
 proposal was no more than 3 characters of difference, but in any
 event not more than half the length of the shorter string.

We can only hint based on the information given by the user. If they
give a lot of badly matching information, we have something to go on.

 That one's right at 50% too, but it's certainly more than 3 characters
 of difference.  I think it's going to be pretty hard to emit a
 suggestion in that case but not in a whole lot of cases that don't
 make any sense.

I don't think that's the case. Other RTEs are penalized for having
non-matching aliases here.

In general, I think the cost of a bad suggestion is much lower than
the benefit of a good one. You seem to be suggesting that they're
equal. Or that they're equally likely in an organic situation. In my
estimation, this is not the case at all.

I'm curious about your thoughts on the compromise of a ramped up
distance threshold to apply a test for the absolute quality of a
match. I think that the fact that git gives bad suggestions with terse
strings tells us a lot, though. Note that unlike git, with terse
strings we may well have a good deal more equidistant matches, and as
soon as the number of would-be matches exceeds 2, we actually give no
matches at all. So that's an additional protection against poor
matches with terse strings.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Alvaro Herrera
Robert Haas wrote:

 And the underlying Levenshtein implementation is here:
 
 https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c
 
 Apparently what they're doing is charging 0 for a transposition (which
 we don't have as a separate concept), 2 for a substitution, 1 for an
 insertion, and 3 for a deletion, with the constraint that anything
 with a total distance of more than 6 isn't considered.

0 for a transposition, wow. I suggested adding transpositions but there
was no support for that idea.  I suggested it because I thikn it's the
most common form of typo, and charging 2 for a deletion plus 1 for an
insertion makes a single transposition mistaek count as 3, which seems
wrong -- particularly seeing the git precedent.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 0 for a transposition, wow.

Again, they're optimizing for short strings (git commands) only. There
just isn't that many transposition errors possible with a 4 character
string.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Alvaro Herrera
Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  0 for a transposition, wow.
 
 Again, they're optimizing for short strings (git commands) only. There
 just isn't that many transposition errors possible with a 4 character
 string.

If there's logic in your statement, I can't see it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  0 for a transposition, wow.

 Again, they're optimizing for short strings (git commands) only. There
 just isn't that many transposition errors possible with a 4 character
 string.

 If there's logic in your statement, I can't see it.

The point is that transposition errors should not have no cost. If git
did not have an absolute quality test of a distance of 6, which they
can only have because all git commands are terse, then you could
construct a counter example.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Alvaro Herrera
Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  Peter Geoghegan wrote:
  On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
  alvhe...@2ndquadrant.com wrote:
   0 for a transposition, wow.
 
  Again, they're optimizing for short strings (git commands) only. There
  just isn't that many transposition errors possible with a 4 character
  string.
 
  If there's logic in your statement, I can't see it.
 
 The point is that transposition errors should not have no cost. If git
 did not have an absolute quality test of a distance of 6, which they
 can only have because all git commands are terse, then you could
 construct a counter example.

Okay.  My point is just that whatever the string length, I think we'd do
well to regard transpositions as cheap in terms of error cost.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-18 Thread Robert Haas
On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 LINE 1: select qty from orderlines ;
^
 HINT:  Perhaps you meant to reference the column orderlines.quantity.
 

I don't buy this example, because it would give you the same hint if
you told it you wanted to access a column called ant, or uay, or tit.
And that's clearly ridiculous.  The reason why quantity looks like a
reasonable suggestion for qty is because it's a conventional
abbreviation, but an extremely high percentage of comparable cases
won't be.

 +/*
 + * Charge extra (for inexact matches only) when an alias was
 + * specified that differs from what might have been used to
 + * correctly qualify this RTE's closest column
 + */
 +if (wrongalias)
 +rtestate.distance += 3;

 I don't understand what situation this is catering to.  Can you
 explain?  It seems to account for a good deal of complexity.

 Two cases:

 1. Distinguishing between the case where there was an exact match to a
 column that isn't visible (i.e. the existing reason for
 errorMissingColumn() to call here), and the case where there is a
 visible column, but our alias was the wrong one. I guess that could
 live in errorMissingColumn(), but overall it's more convenient to do
 it here, so that errorMissingColumn() handles things almost uniformly
 and doesn't really have to care.

 2. For non-exact (fuzzy) matches, it seems more useful to give one
 match rather than two when the user gave an alias that matches one
 particular RTE. Consider this:

 
 postgres=# select ordersid from orders o join orderlines ol on
 o.orderid = ol.orderid;
 ERROR:  42703: column ordersid does not exist
 LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
 HINT:  Perhaps you meant to reference the column o.orderid or the
 column ol.orderid.
 LOCATION:  errorMissingColumn, parse_relation.c:3166

 postgres=# select ol.ordersid from orders o join orderlines ol on
 o.orderid = ol.orderid;
 ERROR:  42703: column ol.ordersid does not exist
 LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
 HINT:  Perhaps you meant to reference the column ol.orderid.
 LOCATION:  errorMissingColumn, parse_relation.c:3147
 

I guess I'm confused at a broader level.  If the alias is wrong, why
are we considering names in this RTE *at all*?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-18 Thread Peter Geoghegan
On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 LINE 1: select qty from orderlines ;
^
 HINT:  Perhaps you meant to reference the column orderlines.quantity.
 

 I don't buy this example, because it would give you the same hint if
 you told it you wanted to access a column called ant, or uay, or tit.
 And that's clearly ridiculous.  The reason why quantity looks like a
 reasonable suggestion for qty is because it's a conventional
 abbreviation, but an extremely high percentage of comparable cases
 won't be.

Is that so terrible? Yes, if those *exact* strings are tried, that'll
happen. But the vast majority of 3 letter strings will not do that
(including many 3 letter strings that include one of the letters 'q',
't' and 'y', such as qqq, ttt, and yyy). Why, in practice, would
someone even attempt those strings? I'm worried about Murphy, not
Machiavelli. That seems like a pretty important distinction here.

I maintain that omission of part of the correct spelling should be
weighed less. I am optimizing for the case where the user has a rough
idea of the structure and spelling of things - if they're typing in
random strings, or totally distinct synonyms, there is little we can
do about that. As I said, there will always be the most marginal case
that still gets a suggestion. I see no reason to hurt the common case
where we help in order to save the user from seeing a ridiculous
suggestion. I have a final test for the absolute quality of a
suggestion, but I think we could easily be too conservative about
that. At worst, our ridiculous suggestion makes apparent that the
user's incorrect spelling was itself ridiculous. With larger strings,
we can afford to be more conservative, and we are, because we have
more information to go on. Terse column names are not uncommon,
though.

 +/*
 + * Charge extra (for inexact matches only) when an alias was
 + * specified that differs from what might have been used to
 + * correctly qualify this RTE's closest column
 + */
 +if (wrongalias)
 +rtestate.distance += 3;

 I don't understand what situation this is catering to.  Can you
 explain?  It seems to account for a good deal of complexity.

 I guess I'm confused at a broader level.  If the alias is wrong, why
 are we considering names in this RTE *at all*?

Because it's a common mistake when writing ad-hoc queries. People may
forget which exact table their column comes from. We certainly want to
weigh the fact that an alias was specified, but it shouldn't totally
limit our guess to that RTE. If nothing else, the fact that there was
a much closer match from another RTE ought to result in forcing there
to be no suggestion (due to there being too many equally good
suggestions). That's because, as I said, an *absolute* test for the
quality of a match is problematic (which, again, is why I err on the
side of letting the final, absolute quality test not limit
suggestions, particularly with short strings).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-18 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 HINT:  Perhaps you meant to reference the column orderlines.quantity.

 I don't buy this example, because it would give you the same hint if
 you told it you wanted to access a column called ant, or uay, or tit.
 And that's clearly ridiculous.  The reason why quantity looks like a
 reasonable suggestion for qty is because it's a conventional
 abbreviation, but an extremely high percentage of comparable cases
 won't be.

 I maintain that omission of part of the correct spelling should be
 weighed less.

I would say that omission of the first letter should completely disqualify
suggestions based on this heuristic; but it might make sense to weight
omissions less after the first letter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-17 Thread Robert Haas
On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch incorporates this feedback.

 The only user-visible difference between this revision and the
 previous revision is that it's quite possible for two suggestion to
 originate from the same RTE (there is exactly one change in the
 regression test's expected output as compared to the last revision for
 this reason. The regression tests are otherwise unchanged). It's still
 not possible to see more than 2 suggestions under any circumstances,
 no matter where they might have originated from, which I think is
 appropriate -- we continue to not present any HINT in the event of 3
 or more equidistant matches.

 I think that the restructuring required to pass around a state
 variable has resulted in somewhat clearer code.

Cool!

I'm grumpy about the distanceName() function.  That seems too generic.
If we're going to keep this as it is, I suggest something like
computeRTEColumnDistance().  But see below.

On a related note, I'm also grumpy about this comment:

+/* Charge half as much per deletion as per insertion or per substitution */
+return varstr_levenshtein_less_equal(actual, len, match, match_len,
+ 2, 1, 2, max);

The purpose of a code comment is to articulate WHY we did something,
rather than simply to restate what the code quite obviously does.  I
haven't heard a compelling argument for why this should be 2, 1, 2
rather than the default 1, 1, 1; and I'm inclined to do the latter
unless you can make some very good argument for this combination of
weights.  And if you can make such an argument, then there should be
comments so that the next person to come along and look at this code
doesn't go, huh, that's whacky, and change it.

+ int location, FuzzyAttrMatchState *rtestate)

I suggest calling this fuzzystate rather than rtestate; it's not
the state of the RTE, but the state of the fuzzy matching.

Within the scanRTEForColumn block, we have a rather large chunk of
code protected by if (rtestate), which contains the only call to
distanceName().  I suggest that we move all of this logic to a
separate, static function, and merge distanceName into it.  I also
suggest testing against NULL explicitly instead of implicitly.  So
this block of code would end up as something like:

if (fuzzystate != NULL)
updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate);

In searchRangeTableForCol, I'm fairly certain that you've changed the
behavior by adding a check for if (rte-rtekind == RTE_JOIN) before
the call to scanRTEForColumn().  Why not instead put this check into
updateFuzzyAttrMatchState?  Then you can be sure you're not changing
the behavior in any other case.

On a similar note, I think the dropped-column test should happen early
as well, probably again in updateFuzzyAttrMatchState().  There's
little point in adding a suggestion only to throw it away again.

+/*
+ * Charge extra (for inexact matches only) when an alias was
+ * specified that differs from what might have been used to
+ * correctly qualify this RTE's closest column
+ */
+if (wrongalias)
+rtestate.distance += 3;

I don't understand what situation this is catering to.  Can you
explain?  It seems to account for a good deal of complexity.

 ERROR:  column oid does not exist
 LINE 1: select oid  0, * from altwithoid;
^
+HINT:  Perhaps you meant to reference the column altwithoid.col.

That seems like a stretch.  I think I suggested before using a
distance threshold of at most 3 or half the word length, whichever is
less.  For a three-letter column name that means not suggesting
anything if more than one character is different.  What you
implemented here is close to that, yet somehow we've got a suggestion
slipping through that has 2 out of 3 characters different.  I'm not
quite sure I see how that's getting through, but I think it shouldn't.

 ERROR:  column fullname.text does not exist
 LINE 1: select fullname.text from fullname;
^
+HINT:  Perhaps you meant to reference the column fullname.last.

Same problem, only worse!  They've only got one letter of four in common.

 ERROR:  column xmin does not exist
 LINE 1: select xmin, * from fooview;
^
+HINT:  Perhaps you meant to reference the column fooview.x.

Basically the same problem again.  I think the distance threshold in
this case should be half the shorter column name, i.e. 0.

Your new test cases include no negative test cases; that is, cases
where the machinery declines to suggest a hint because of, say, 3
equally good possibilities.  They probably should have something like
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-17 Thread Peter Geoghegan
On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas robertmh...@gmail.com wrote:
 I'm grumpy about the distanceName() function.  That seems too generic.
 If we're going to keep this as it is, I suggest something like
 computeRTEColumnDistance().  But see below.

Fair point.

 On a related note, I'm also grumpy about this comment:

 +/* Charge half as much per deletion as per insertion or per substitution 
 */
 +return varstr_levenshtein_less_equal(actual, len, match, match_len,
 + 2, 1, 2, max);

 The purpose of a code comment is to articulate WHY we did something,
 rather than simply to restate what the code quite obviously does.  I
 haven't heard a compelling argument for why this should be 2, 1, 2
 rather than the default 1, 1, 1; and I'm inclined to do the latter
 unless you can make some very good argument for this combination of
 weights.  And if you can make such an argument, then there should be
 comments so that the next person to come along and look at this code
 doesn't go, huh, that's whacky, and change it.

Okay. I agree that that deserves a comment. The actual argument for
this formulation is that it just seems to work better that way. For
example:


postgres=# \d orderlines
 Table public.orderlines
   Column|   Type   | Modifiers
-+--+---
 orderlineid | integer  | not null
 orderid | integer  | not null
 prod_id | integer  | not null
 quantity| smallint | not null
 orderdate   | date | not null

postgres=# select qty from orderlines ;
ERROR:  42703: column qty does not exist
LINE 1: select qty from orderlines ;
   ^
HINT:  Perhaps you meant to reference the column orderlines.quantity.


The point is that the fact that the user supplied qty string has so
many fewer characters than what was obviously intended - quantity -
deserves to be weighed less. If you change the costing to weigh
character deletion as being equal to substitution/addition, this
example breaks. I also think it's pretty common to have noise words in
every attribute (e.g. every column in the orderlines table matches
orderlines_*), which might otherwise mess things up by overcharging
for deletion. Having extra characters in the correctly spelled column
name seems legitimately less significant to me.

Or, in other words: having actual characters from the misspelling
match the correct spelling (and having actual characters given not
fail to match) is most important. What was given by the user is more
important than what was not given but should have been, which is not
generally true for uses of Levenshtein distance. I reached this
conclusion through trying out the patch with a couple of real schemas,
and seeing what works best.

It's hard to express that idea tersely, in a comment, but I guess I'll try.

 + int location, FuzzyAttrMatchState *rtestate)

 I suggest calling this fuzzystate rather than rtestate; it's not
 the state of the RTE, but the state of the fuzzy matching.

The idea here was to differentiate this state from the overall range
table state (in general, FuzzyAttrMatchState may be one or the other).
But okay.

 Within the scanRTEForColumn block, we have a rather large chunk of
 code protected by if (rtestate), which contains the only call to
 distanceName().  I suggest that we move all of this logic to a
 separate, static function, and merge distanceName into it.  I also
 suggest testing against NULL explicitly instead of implicitly.  So
 this block of code would end up as something like:

 if (fuzzystate != NULL)
 updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate);

Okay.

 In searchRangeTableForCol, I'm fairly certain that you've changed the
 behavior by adding a check for if (rte-rtekind == RTE_JOIN) before
 the call to scanRTEForColumn().  Why not instead put this check into
 updateFuzzyAttrMatchState?  Then you can be sure you're not changing
 the behavior in any other case.

I thought that I had avoided changing things (beyond what was
advertised as changed in relation to this most recent revision)
because I also changed things WRT multiple matches per RTE. It's
fuzzy. Anyway, yeah, I could do it there instead.

 On a similar note, I think the dropped-column test should happen early
 as well, probably again in updateFuzzyAttrMatchState().  There's
 little point in adding a suggestion only to throw it away again.

Agreed.

 +/*
 + * Charge extra (for inexact matches only) when an alias was
 + * specified that differs from what might have been used to
 + * correctly qualify this RTE's closest column
 + */
 +if (wrongalias)
 +rtestate.distance += 3;

 I don't understand what situation this is catering to.  Can you
 explain?  It seems to account for a good deal of complexity.

Two cases:

1. Distinguishing between the case where there was an exact match to a
column that isn't visible 

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-15 Thread Peter Geoghegan
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote:
 On that topic, I think there's unanimous consensus against the design
 where equally-distant matches are treated differently based on whether
 they are in the same RTE or different RTEs.  I think you need to
 change that if you want to get anywhere with this.  On a related note,
 the use of the additional parameter AttrNumber closest[2] to
 searchRangeTableForCol() and of the additional parameters AttrNumber
 *matchedatt and int *distance to scanRTEForColumn() is less than
 self-documenting.  I suggest creating a structure called something
 like FuzzyAttrMatchState and passing a pointer to it down to both
 functions.

Attached patch incorporates this feedback.

The only user-visible difference between this revision and the
previous revision is that it's quite possible for two suggestion to
originate from the same RTE (there is exactly one change in the
regression test's expected output as compared to the last revision for
this reason. The regression tests are otherwise unchanged). It's still
not possible to see more than 2 suggestions under any circumstances,
no matter where they might have originated from, which I think is
appropriate -- we continue to not present any HINT in the event of 3
or more equidistant matches.

I think that the restructuring required to pass around a state
variable has resulted in somewhat clearer code.

-- 
Peter Geoghegan
From 0aef5253f10ebb1ee5bbcc73782eff1352c7ab84 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Wed, 12 Nov 2014 15:31:37 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in an ad-hoc SQL query, or
incorrectly pluralizes or fails to pluralize a column reference, or
incorrectly omits or includes an underscore or other punctuation
character.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (these
may come from multiple RTEs, or the same RTE).  Limiting to two the
number of cases where multiple equally likely suggestions are all
offered at once (i.e.  giving no hint when the number of equally likely
candidates exceeds two) is a measure against suggestions that are of low
quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.
---
 src/backend/parser/parse_expr.c   |   9 +-
 src/backend/parser/parse_func.c   |   2 +-
 src/backend/parser/parse_relation.c   | 345 +++---
 src/backend/utils/adt/levenshtein.c   |   9 +
 src/include/parser/parse_relation.h   |  20 +-
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out|  39 
 src/test/regress/expected/plpgsql.out |   1 +
 src/test/regress/expected/rowtypes.out|   1 +
 src/test/regress/expected/rules.out   |   1 +
 src/test/regress/expected/without_oid.out |   2 +
 src/test/regress/sql/join.sql |  24 +++
 12 files changed, 421 insertions(+), 40 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..a77a3a0 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field2);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field3);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field4);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..472e15e 100644
--- a/src/backend/parser/parse_func.c
+++ 

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Robert Haas
On Wed, Nov 12, 2014 at 10:42 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan p...@heroku.com wrote:
 I'm pretty puzzled by this. Other than our agree to disagree and
 defer to committer position on the question of whether or not more
 than one suggestion can come from a single RTE, which you were fine
 with before [1], I have only restored the core/contrib separation to a
 state recently suggested by Robert as the best and simplest all around
 [2].
 Did I miss something else?
 My point is: I am not sure I can be defined as a reviewer of this
 patch or take any credit in this patch review knowing that the latest
 version submitted is a simple rebase of the version I did my first
 review on. Hence, code speaking, this patch is in the same state as
 when it has been firstly submitted.

Of course you can.  Time spent reviewing is time spent reviewing,
whether it results in changes to the patch or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Peter Geoghegan
On Thu, Nov 13, 2014 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote:
 My point is: I am not sure I can be defined as a reviewer of this
 patch or take any credit in this patch review knowing that the latest
 version submitted is a simple rebase of the version I did my first
 review on. Hence, code speaking, this patch is in the same state as
 when it has been firstly submitted.

 Of course you can.  Time spent reviewing is time spent reviewing,
 whether it results in changes to the patch or not.


My thoughts exactly. I thought Michael did a good job, even if I
didn't agree with everything he said.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Robert Haas
On Wed, Nov 12, 2014 at 8:00 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch moves the Levenshtein distance implementation into core.

 Oops. Somehow managed to send a *.patch.swp file.  :-)

 Here is the actual patch.

Committed.  I changed varstr_leven() to varstr_levenshtein() because
abbrvs cn mk the code hrd to undstnd.  And to grep.  And I removed the
StaticAssertStmt you added, because it's not actually used for
anything that necessitates that, yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Peter Geoghegan
On Thu, Nov 13, 2014 at 9:36 AM, Robert Haas robertmh...@gmail.com wrote:
 Committed.  I changed varstr_leven() to varstr_levenshtein() because
 abbrvs cn mk the code hrd to undstnd.  And to grep.

Thanks. I'll produce a revision of patch 2/2 soon.

 And I removed the
 StaticAssertStmt you added, because it's not actually used for
 anything that necessitates that, yet.

I'll add it back in in patch 2/2, so.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Robert Haas
On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 Based on this review from a month ago, I'm going to mark this Waiting
 on Author.  If nobody updates the patch in a few days, I'll mark it
 Returned with Feedback.  Thanks.

 Attached revision fixes the compiler warning that Heikki complained
 about. I maintain SQL-callable stub functions from within contrib,
 rather than follow Michael's approach. In other words, very little has
 changed from my revision from July last [1].

I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors.  If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future.  Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.

On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs.  I think you need to
change that if you want to get anywhere with this.  On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting.  I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote:
 I agree with your proposed approach to moving Levenshtein into core.
 However, I think this should be separated into two patches, one of
 them moving the Levenshtein functionality into core, and the other
 adding the new treatment for missing column errors.  If you can do
 that relatively soon, I'll make an effort to get the refactoring patch
 committed in the near future.  Once that's done, we can focus in on
 the interesting part of the patch, which is the actual machinery for
 suggesting alternatives.

Okay, thanks. I think I can do that fairly soon.

 On that topic, I think there's unanimous consensus against the design
 where equally-distant matches are treated differently based on whether
 they are in the same RTE or different RTEs.  I think you need to
 change that if you want to get anywhere with this.

Alright. It wasn't as if I felt very strongly about it either way.

 On a related note,
 the use of the additional parameter AttrNumber closest[2] to
 searchRangeTableForCol() and of the additional parameters AttrNumber
 *matchedatt and int *distance to scanRTEForColumn() is less than
 self-documenting.  I suggest creating a structure called something
 like FuzzyAttrMatchState and passing a pointer to it down to both
 functions.

Sure.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan
On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote:
 I agree with your proposed approach to moving Levenshtein into core.
 However, I think this should be separated into two patches, one of
 them moving the Levenshtein functionality into core, and the other
 adding the new treatment for missing column errors.  If you can do
 that relatively soon, I'll make an effort to get the refactoring patch
 committed in the near future.  Once that's done, we can focus in on
 the interesting part of the patch, which is the actual machinery for
 suggesting alternatives.

 Okay, thanks. I think I can do that fairly soon.

Attached patch moves the Levenshtein distance implementation into core.

You're missing patch 2 of 2 here, because I have yet to incorporate
your feedback on the HINT itself -- when I've done that, I'll post a
newly rebased patch 2/2, with those items taken care of. As you
pointed out, there is no reason to wait for that.

-- 
Peter Geoghegan


.0001-Move-Levenshtein-distance-implementation-into-core.patch.swp
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan
On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch moves the Levenshtein distance implementation into core.

Oops. Somehow managed to send a *.patch.swp file.  :-)

Here is the actual patch.

-- 
Peter Geoghegan
From b7df918f1a52107637600f3b22d1cff18bd07ae1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH 1/2] Move Levenshtein distance implementation into core

The fuzzystmatch Levenshtein distance implementation is moved from
/contrib to core.  However, the related SQL-callable functions may still
only be used with the fuzzystmatch extension installed -- the
fuzzystmatch definitions become simple forwarding stubs.

It is not anticipated that the user-facing SQL functions will be moved
into core in the future, due to the MAX_LEVENSHTEIN_STRLEN restriction.
An in-core Levenshtein distance implementation is only anticipated to be
helpful in building diagnostic messages.
---
 contrib/fuzzystrmatch/Makefile|   3 -
 contrib/fuzzystrmatch/fuzzystrmatch.c |  81 ---
 contrib/fuzzystrmatch/levenshtein.c   | 403 --
 src/backend/utils/adt/Makefile|   2 +
 src/backend/utils/adt/levenshtein.c   | 394 +
 src/backend/utils/adt/varlena.c   |  24 ++
 src/include/utils/builtins.h  |   5 +
 7 files changed, 481 insertions(+), 431 deletions(-)
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c)  16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len  0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include levenshtein.c
-#define LEVENSHTEIN_LESS_EQUAL
-#include levenshtein.c
-
 PG_FUNCTION_INFO_V1(levenshtein_with_costs);
 Datum
 levenshtein_with_costs(PG_FUNCTION_ARGS)
@@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS)
 	int			ins_c = PG_GETARG_INT32(2);
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c,
+ del_c, sub_c));
 }
 
 
@@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS)
 {
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, 1, 1,
+	   1));
 }
 
 
@@ -206,8 +213,20 @@ levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
 	int			max_d = PG_GETARG_INT32(5);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+			ins_c, del_c, sub_c, max_d));
 }
 
 
@@ -218,8 +237,20 @@ levenshtein_less_equal(PG_FUNCTION_ARGS)
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
 	int			max_d = PG_GETARG_INT32(2);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Michael Paquier
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote:

 Reminder: I maintain a slight preference for only offering one
 suggestion per relation RTE, which is what this revision does (so no
 change there). If a committer who picks this up wants me to alter
 that, I don't mind doing so; since only Michael spoke up on this, I've
 kept things my way.

Hm. The last version of this patch has not really changed since since my
first review, and I have no more feedback to provide about it except what I
already mentioned. I honestly don't think that this patch is ready for
committer as-is... If someone wants to review it further, well extra
opinions I am sure are welcome.
-- 
Michael


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Peter Geoghegan
On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Hm. The last version of this patch has not really changed since since my
 first review, and I have no more feedback to provide about it except what I
 already mentioned. I honestly don't think that this patch is ready for
 committer as-is... If someone wants to review it further, well extra
 opinions I am sure are welcome.

Why not?

You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.

What does that leave?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Peter Geoghegan
On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan p...@heroku.com wrote:
 Why not?

 You've already said that you're happy to defer to whatever committer
 picks this up with regard to whether or not more than a single
 suggestion can come from an RTE. I agreed with this (i.e. I said I'd
 defer to their opinion too), and once again drew particular attention
 to this state of affairs alongside my most recent revision.

 What does that leave?

I see you've marked this Needs Review, even though your previously
marked it Ready for Committer a few months back (Robert marked it
Waiting on Author very recently because of the compiler warning, and
then I marked it back to Ready for Committer once that was
addressed, before you finally marked it back to Needs Review and
removed yourself as the reviewer just now).

I'm pretty puzzled by this. Other than our agree to disagree and
defer to committer position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1], I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2].

Did I miss something else?

[1] 
http://www.postgresql.org/message-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-bvqzgp0qa1y_xdc...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/ca+tgmoykiiq8mc0uj5i5xfktybg1qqfn4yrckz60ydunumk...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-09 Thread Peter Geoghegan
On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 Based on this review from a month ago, I'm going to mark this Waiting
 on Author.  If nobody updates the patch in a few days, I'll mark it
 Returned with Feedback.  Thanks.

Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1].

Reminder: I maintain a slight preference for only offering one
suggestion per relation RTE, which is what this revision does (so no
change there). If a committer who picks this up wants me to alter
that, I don't mind doing so; since only Michael spoke up on this, I've
kept things my way.

This is not a completion mechanism; it is supposed to work on
*complete* column references with slight misspellings (e.g. incorrect
use of plurals, or column references with an omitted underscore
character). Weighing Tom's concerns about suggestions that are of
absolute low quality is what makes me conclude that this is the thing
to do.

[1] 
http://www.postgresql.org/message-id/CAM3SWZTzQO=oy4jmfb-65iefie8ihukderk-0oljetm8dsr...@mail.gmail.com
-- 
Peter Geoghegan
From 830bf9f668972ba6b531df5d4fcbd73db3472434 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in his or her ad-hoc SQL
query, or incorrectly pluralizes or fails to pluralize a column
reference.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (iff each
candidate column comes from a separate RTE).  Limiting the cases where
multiple equally likely suggestions are all offered at once is a measure
against suggestions that are of low quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.

The contrib Levenshtein distance implementation is moved from /contrib
to core.  However, the SQL-callable functions may only be used with the
fuzzystmatch extension installed, just as before -- the fuzzystmatch
definitions become mere forwarding stubs.
---
 contrib/fuzzystrmatch/Makefile|   3 -
 contrib/fuzzystrmatch/fuzzystrmatch.c |  81 --
 contrib/fuzzystrmatch/levenshtein.c   | 403 --
 src/backend/parser/parse_expr.c   |   9 +-
 src/backend/parser/parse_func.c   |   2 +-
 src/backend/parser/parse_relation.c   | 319 ---
 src/backend/utils/adt/Makefile|   2 +
 src/backend/utils/adt/levenshtein.c   | 393 +
 src/backend/utils/adt/varlena.c   |  25 ++
 src/include/parser/parse_relation.h   |   3 +-
 src/include/utils/builtins.h  |   5 +
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out|  39 +++
 src/test/regress/expected/plpgsql.out |   1 +
 src/test/regress/expected/rowtypes.out|   1 +
 src/test/regress/expected/rules.out   |   1 +
 src/test/regress/expected/without_oid.out |   1 +
 src/test/regress/sql/join.sql |  24 ++
 18 files changed, 849 insertions(+), 471 deletions(-)
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c)  16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len  0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include levenshtein.c
-#define LEVENSHTEIN_LESS_EQUAL
-#include levenshtein.c
-
 PG_FUNCTION_INFO_V1(levenshtein_with_costs);
 Datum
 levenshtein_with_costs(PG_FUNCTION_ARGS)
@@ 

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-09 Thread Michael Paquier
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote:

 On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com
 wrote:
  Based on this review from a month ago, I'm going to mark this Waiting
  on Author.  If nobody updates the patch in a few days, I'll mark it
  Returned with Feedback.  Thanks.

 Attached revision fixes the compiler warning that Heikki complained
 about. I maintain SQL-callable stub functions from within contrib,
 rather than follow Michael's approach. In other words, very little has
 changed from my revision from July last [1].

FWIW, I still find this bit of code that this patch adds in varlena.c ugly:
+#include levenshtein.c
+#define LEVENSHTEIN_LESS_EQUAL
+#include levenshtein.c
+#undef LEVENSHTEIN_LESS_EQUAL
-- 
Michael


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-09 Thread Peter Geoghegan
On Sun, Nov 9, 2014 at 8:56 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 FWIW, I still find this bit of code that this patch adds in varlena.c ugly:

 +#include levenshtein.c
 +#define LEVENSHTEIN_LESS_EQUAL
 +#include levenshtein.c
 +#undef LEVENSHTEIN_LESS_EQUAL

Okay, but this is the coding that currently appears within contrib's
fuzzystrmatch.c, more or less unchanged. The #undef
LEVENSHTEIN_LESS_EQUAL line that I added ought to be unnecessary.
I'll give the final word on that to whoever picks this up.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-07 Thread Robert Haas
On Mon, Oct 6, 2014 at 3:09 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 07/18/2014 10:47 AM, Michael Paquier wrote:

 On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan p...@heroku.com wrote:

 I am not opposed to moving the contrib code into core in the manner
 that you oppose. I don't feel strongly either way.

 I noticed in passing that your revision says this *within* levenshtein.c:

 + * Guaranteed to work with Name datatype's cstrings.
 + * For full details see levenshtein.c.

 Yeah, I looked at what I produced yesterday night again and came
 across a couple of similar things :) And reworked a couple of things
 in the version attached, mainly wordsmithing and adding comments here
 and there, as well as making the naming of the Levenshtein functions
 in core the same as the ones in fuzzystrmatch 1.0.

 I imagined that when a committer picked this up, an executive decision
 would be made one way or the other. I am quite willing to revise the
 patch to alter this behavior at the request of a committer.

 Fine for me. I'll move this patch to the next stage then.


 There are a bunch of compiler warnings:

 parse_relation.c: In function ‘errorMissingColumn’:
 parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized
 in this function [-Wmaybe-uninitialized]
 parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
 parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized
 in this function [-Wmaybe-uninitialized]
 parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
 levenshtein.c: In function ‘levenshtein_common’:
 levenshtein.c:107:6: warning: unused variable ‘start_column_local’
 [-Wunused-variable]

Based on this review from a month ago, I'm going to mark this Waiting
on Author.  If nobody updates the patch in a few days, I'll mark it
Returned with Feedback.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-10-06 Thread Heikki Linnakangas

On 07/18/2014 10:47 AM, Michael Paquier wrote:

On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan p...@heroku.com wrote:

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.


I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

Fine for me. I'll move this patch to the next stage then.


There are a bunch of compiler warnings:

parse_relation.c: In function ‘errorMissingColumn’:
parse_relation.c:3114:447: warning: ‘closestcol1’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]

parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
parse_relation.c:3129:29: warning: ‘closestcol2’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]

parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
levenshtein.c: In function ‘levenshtein_common’:
levenshtein.c:107:6: warning: unused variable ‘start_column_local’ 
[-Wunused-variable]


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Robert Haas
On Thu, Jul 17, 2014 at 9:34 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Patch 1 does a couple of things:
 - fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of
 it anymore, and moved to core.
 - Removal of the LESS_EQUAL flag that made the original submission patch
 harder to understand. All the Levenshtein functions wrap a single common
 function.
 - Documentation is moved, and regression tests for Levenshtein functions are
 added.
 - Functions with costs are renamed with a suffix with costs.
 After hacking this feature, I came up with the conclusion that it would be
 better for the user experience to move directly into backend code all the
 Levenshtein functions, instead of only moving in the common wrapper as Peter
 did in his original patches. This is done this way to avoid keeping portions
 of the same feature in two different places of the code (backend with common
 routine, fuzzystrmatch with levenshtein functions) and concentrate all the
 logic in a single place. Now, we may as well consider renaming the
 levenshtein functions into smarter names, like str_distance, and keep
 fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the
 str_distance functions.

This is not cool.  Anyone who is running a 9.4 or earlier database
using fuzzystrmatch and upgrades, either via dump-and-restore or
pg_upgrade, to a version with this patch applied will have a broken
database.  They will still have the catalog entries for the 1.0
definitions, but those definitions won't be resolvable inside the new
cluster's .so file.  The user will get a fairly-unfriendly error
message that won't go away until they upgrade the extension, which may
involve dealing with dependency hell since the new definitions are in
a different place than the old definitions, and there may be
dependencies on the old definitions.  One of the great advantages of
extension packaging is that this kind of problem is quite easily
avoidable, so let's avoid it.

There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.  Then, the
whole thing will be completely transparent to users.  They won't need
to upgrade their fuzzystrmatch definitions at all, and everything will
just work; under the covers, the fuzzystrmatch code will now be
calling into core code rather than to code located in that same
module, but the user doesn't need to know or care about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 There are several possible methods of doing that, but I think the best
 one is just to leave the SQL-callable C functions in fuzzystrmatch and
 move only the underlying code that supports into core.

I hadn't been paying close attention to this thread, but I'd just assumed
that that would be the approach.

It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Peter Geoghegan
On Wed, Jul 23, 2014 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote:
 There are several possible methods of doing that, but I think the best
 one is just to leave the SQL-callable C functions in fuzzystrmatch and
 move only the underlying code that supports into core.

For some reason I thought that that was what Michael was proposing - a
more comprehensive move of code into core than the structuring that I
proposed. I actually thought about a Levenshtein distance operator at
one point months ago, before I entirely gave up on that. The
MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
distance functions are not suitable for core as is (although that
doesn't matter for my purposes, since all I need is something that
accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
considerable limitation for an in-core feature. I didn't get around to
forming an opinion on how and if that should be fixed.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Alvaro Herrera
Peter Geoghegan wrote:

 For some reason I thought that that was what Michael was proposing - a
 more comprehensive move of code into core than the structuring that I
 proposed. I actually thought about a Levenshtein distance operator at
 one point months ago, before I entirely gave up on that. The
 MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
 distance functions are not suitable for core as is (although that
 doesn't matter for my purposes, since all I need is something that
 accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
 considerable limitation for an in-core feature. I didn't get around to
 forming an opinion on how and if that should be fixed.

I had two thoughts:

1. Should we consider making levenshtein available to frontend programs
as well as backend?
2. Would it provide better matching to use Damerau-Levenshtein[1] instead
of raw Levenshtein?

.oO(Would anyone be so bold as to attempt to implement bitap[2] using
bitmapsets ...)

[1] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
[2] http://en.wikipedia.org/wiki/Bitap_algorithm

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Peter Geoghegan
On Wed, Jul 23, 2014 at 1:10 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 I had two thoughts:

 1. Should we consider making levenshtein available to frontend programs
 as well as backend?

I don't think so. Why would that be useful?

 2. Would it provide better matching to use Damerau-Levenshtein[1] instead
 of raw Levenshtein?

Maybe that would be marginally better than classic Levenshtein
distance, but I doubt it would pay for itself. It's just more code to
maintain. Are we really expecting to not get the best possible
suggestion due to some number of transposition errors very frequently?
You still have to have a worse suggestion spuriously get ahead of
yours, and typically there just aren't that many to begin with. I'm
not targeting spelling errors so much as thinkos around plurals and
whether or not an underscore was used. Damerau-Levenshtein seems like
an algorithm with fairly specialized applications.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Michael Paquier
On Thu, Jul 24, 2014 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Robert Haas robertmh...@gmail.com writes:
  There are several possible methods of doing that, but I think the best
  one is just to leave the SQL-callable C functions in fuzzystrmatch and
  move only the underlying code that supports into core.

 I hadn't been paying close attention to this thread, but I'd just assumed
 that that would be the approach.

 It might be worth introducing new differently-named pg_proc entries for
 the same functions in core, but only if we can agree that there are better
 names for them than what the extension uses.

Yes, that's a point I raised upthread as well. What about renaming those
functions as string_distance and string_distance_less_than? Then have only
fuzzystrmatch do some DirectFunctionCall using the in-core functions?
-- 
Michael


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Alvaro Herrera
Peter Geoghegan wrote:

 Maybe that would be marginally better than classic Levenshtein
 distance, but I doubt it would pay for itself. It's just more code to
 maintain. Are we really expecting to not get the best possible
 suggestion due to some number of transposition errors very frequently?
 You still have to have a worse suggestion spuriously get ahead of
 yours, and typically there just aren't that many to begin with. I'm
 not targeting spelling errors so much as thinkos around plurals and
 whether or not an underscore was used. Damerau-Levenshtein seems like
 an algorithm with fairly specialized applications.

Yes, it's for typos.  I guess it's an unfrequent scenario to have both a
typoed column and a column that's missing the plural declension, which
is the case in which Damerau-Lvsh would be a win.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-17 Thread Peter Geoghegan
I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.


On Thu, Jul 17, 2014 at 6:34 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Patch 2 is a rebase of the feature of Peter that can be applied on top of
 patch 1. The code is rather untouched (haven't much played with Peter's
 thingies), well-commented, but I think that this needs more work,
 particularly when a query has a single RTE like in this case where no hints
 are proposed to the user (mentioned upthread):

The only source of disagreement that I am aware of at this point is
the question of whether or not we should accept two candidates from
the same RTE. I lean slightly towards no, as already explained [1]
[2], but it's not as if I feel that strongly either way - this
approach of looking for only a single best candidate per RTE taken in
deference to the concerns of others.

I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

[1] 
http://www.postgresql.org/message-id/CAM3SWZTrm4PmqMmL9=eyx-8f-vx-ha7dme4koms2vcomozg...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/cam3swzs6kiqeqjz4pv3fkp6cgw1ws26exoqtjb_xmw3ze5b...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Michael Paquier
On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote:
  I was worried about the common case where a
  column name is misspelled that would otherwise be ambiguous, which is
  why that shows a HINT while the single RTE case doesn't

 To be clear - I mean a HINT with two suggestions rather than just one.
 If there are 3 or more equally distant suggestions (even if they're
 all from different RTEs) we also give no HINT in the proposed patch.

Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.
-- 
Michael


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Michael Paquier
On Wed, Jul 9, 2014 at 1:49 AM, Peter Geoghegan p...@heroku.com wrote:

  4) This is not nice, could it be possible to remove the stuff from
 varlena.c?
  +/* Expand each Levenshtein distance variant */
  +#include levenshtein.c
  +#define LEVENSHTEIN_LESS_EQUAL
  +#include levenshtein.c
  +#undef LEVENSHTEIN_LESS_EQUAL
  Part of the same comment: only varstr_leven_less_equal is used to
  calculate the distance, should we really move varstr_leven to core?
  This clearly needs to be reworked as not just a copy-paste of the
  things in fuzzystrmatch.
  The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I
 think.

 So there'd be one variant within core and one within
 contrib/fuzzystrmatch? I don't think that's an improvement.

No. The main difference between varstr_leven_less_equal and varstr_leven is
the use of the extra argument max_d in the former. My argument here is
instead of blindly cut-pasting into core the code you are interested in to
evaluate the string distances, is to refactor it to have a unique function,
and to let the business with LEVENSHTEIN_LESS_EQUAL within
contrib/fuzzystrmatch. This will require some reshuffling of the distance
function, but by looking at this patch I am getting the feeling that this
is necessary, and should even be split into a first patch for fuzzystrmatch
that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?

 5) Do we want hints on system columns as well?
 I think it's obvious that the answer must be no. That's going to
 frequently result in suggestions of columns that users will complain
 aren't even there. If you know about the system columns, you can just
 get it right. They're supposed to be hidden for most purposes.

This may sound ridiculous, but I have already found myself mistyping ctid
by tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on
(bad morning). I would have welcomed such hints in those cases.
-- 
Michael


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Peter Geoghegan
On Tue, Jul 8, 2014 at 11:10 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Showing up to 2 hints is fine as it does not pollute the error output with
 perhaps unnecessary messages. That's even more protective than for example
 git that prints all the equidistant candidates. However I can't understand
 why it does not show up hints even if there are two equidistant candidates
 from the same RTE. I think it should.

Everyone is going to have an opinion on something like that. I was
showing deference to the general concern about the absolute (as
opposed to relative) quality of the HINTs in the event of equidistant
matches by having no two suggestions come from within a single RTE,
while still covering the case I thought was important by having two
suggestions if there were two equidistant matches across RTEs. I think
that's marginally better then what you propose, because your case
deals with two equidistant though distinct columns, bringing into
question the validity of both would-be suggestions. I'll defer to
whatever the consensus is.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Peter Geoghegan
On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 So there'd be one variant within core and one within
 contrib/fuzzystrmatch? I don't think that's an improvement.

 No. The main difference between varstr_leven_less_equal and varstr_leven is
 the use of the extra argument max_d in the former. My argument here is
 instead of blindly cut-pasting into core the code you are interested in to
 evaluate the string distances, is to refactor it to have a unique function,
 and to let the business with LEVENSHTEIN_LESS_EQUAL within
 contrib/fuzzystrmatch. This will require some reshuffling of the distance
 function, but by looking at this patch I am getting the feeling that this is
 necessary, and should even be split into a first patch for fuzzystrmatch
 that would facilitate its integration into core.
 Also why is rest_of_char_same within varlena.c?

Just as before, rest_of_char_same() exists for the express purpose of
being called by the two variants varstr_leven_less_equal() and
varstr_leven(). Why wouldn't I copy it over too along with those two?
Where do you propose to put it?

Obviously the existing macro hacks (that I haven't changed) that build
the two variants are not terribly pretty, but they're not arbitrary
either. They reflect the fact that there is no natural way to add
callbacks or something like that. If you pretended that the core code
didn't have to care about one case or the other, and that contrib was
somehow obligated to hook in its own handler for the
!LEVENSHTEIN_LESS_EQUAL case that it now only cares about, then you'd
end up with an even bigger mess. Besides, with the patch the core code
is calling varstr_leven_less_equal(), which is the bigger of the two
variants - it's the LEVENSHTEIN_LESS_EQUAL case, not the
!LEVENSHTEIN_LESS_EQUAL case that core cares about for the purposes of
building HINTs. In short, I don't know what you mean. What would that
reshuffling actually look like?

  5) Do we want hints on system columns as well?
 I think it's obvious that the answer must be no. That's going to
 frequently result in suggestions of columns that users will complain
 aren't even there. If you know about the system columns, you can just
 get it right. They're supposed to be hidden for most purposes.

 This may sound ridiculous, but I have already found myself mistyping ctid by
 tid and cid while working on patches and modules that played with page
 format, and needing a couple of minutes to understand what was going on (bad
 morning).

I think that it's clearly not worth it, even if it is true that a
minority sometimes make this mistake. Most users don't know that there
are system columns. It's not even close to being worth it to bring
that into this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Alvaro Herrera
Peter Geoghegan wrote:
 On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

   5) Do we want hints on system columns as well?
  I think it's obvious that the answer must be no. That's going to
  frequently result in suggestions of columns that users will complain
  aren't even there. If you know about the system columns, you can just
  get it right. They're supposed to be hidden for most purposes.
 
  This may sound ridiculous, but I have already found myself mistyping ctid by
  tid and cid while working on patches and modules that played with page
  format, and needing a couple of minutes to understand what was going on (bad
  morning).
 
 I think that it's clearly not worth it, even if it is true that a
 minority sometimes make this mistake. Most users don't know that there
 are system columns. It's not even close to being worth it to bring
 that into this.

I agree with Peter.  This is targeted at regular users.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Robert Haas
On Wed, Jul 9, 2014 at 2:10 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote:
  I was worried about the common case where a
  column name is misspelled that would otherwise be ambiguous, which is
  why that shows a HINT while the single RTE case doesn't

 To be clear - I mean a HINT with two suggestions rather than just one.
 If there are 3 or more equally distant suggestions (even if they're
 all from different RTEs) we also give no HINT in the proposed patch.

 Showing up to 2 hints is fine as it does not pollute the error output with
 perhaps unnecessary messages. That's even more protective than for example
 git that prints all the equidistant candidates. However I can't understand
 why it does not show up hints even if there are two equidistant candidates
 from the same RTE. I think it should.

Me, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Greg Stark
On Wed, Jul 9, 2014 at 7:29 AM, Peter Geoghegan p...@heroku.com wrote:
 Everyone is going to have an opinion on something like that. I was
 showing deference to the general concern about the absolute (as
 opposed to relative) quality of the HINTs in the event of equidistant
 matches by having no two suggestions come from within a single RTE,
 while still covering the case I thought was important by having two
 suggestions if there were two equidistant matches across RTEs. I think
 that's marginally better then what you propose, because your case
 deals with two equidistant though distinct columns, bringing into
 question the validity of both would-be suggestions. I'll defer to
 whatever the consensus is.

I agree this is bike shedding. But as long as we're bike shedding...

A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance. If they match exactly except for the case and the
unmatched column is all lower case add a comment that quoting is
required due to the mixed case.

Honestly the current logic and the previous logic both seemed
reasonable to me. They're not going to be perfect in every case so
anything that comes up some some suggestions is fine.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Peter Geoghegan
On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark st...@mit.edu wrote:
 A simple rule is easier for users to understand as well as to code. I
 would humbly suggest the following: take all the unqualified column
 names, downcase them, check which ones match most closely the
 unmatched column. Show the top 3 matches if they're within some
 arbitrary distance.

That's harder than it sounds. You need even more translatable strings
for variant ereports(). I don't think that an easy to understand rule
is necessarily of much value - I'm already charging half price for
deletion because I found representative errors more useful in certain
cases by doing so. I think we want something that displays the most
useful suggestion as often as is practically possible, and does not
display unhelpful suggestions to the extent that it's practical to
avoid them. Plus, as I mentioned, I'm keen to avoid adding more stuff
to scanRTEForColumn() than I already have.

 Honestly the current logic and the previous logic both seemed
 reasonable to me. They're not going to be perfect in every case so
 anything that comes up some some suggestions is fine.

I think that the most recent revision is somewhat better due to the
feedback of Tom and Robert. I didn't feel as strongly as they did
about erring on the side of not showing a HINT, but I think the most
recent revision is a good compromise. But yes, at this point we're
certainly chasing diminishing returns. There are almost any number of
variants of this basic idea that could be suggested.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Peter Geoghegan
On Wed, Jul 9, 2014 at 8:06 AM, Robert Haas robertmh...@gmail.com wrote:
 Showing up to 2 hints is fine as it does not pollute the error output with
 perhaps unnecessary messages. That's even more protective than for example
 git that prints all the equidistant candidates. However I can't understand
 why it does not show up hints even if there are two equidistant candidates
 from the same RTE. I think it should.

 Me, too.

The idea is that each RTE gets one best suggestion, because if there
are two best suggestions within an RTE they're probably both wrong.
Whereas across RTEs, it's probably just that there is a foreign key
relationship between the two (and the user accidentally failed to
qualify the particular column of interest on top of the misspelling, a
qualification that would be sufficient to have the code prefer the
qualified-but-misspelled column). Clearly if I was to do what you
suggest it would be closer to a wild guess, and Tom has expressed
concerns about that.

Now, I don't actually ensure that the column names of the two columns
(each from separate RTEs) are identical save for their would-be alias,
but that's just a consequence of the implementation. Also, as I've
mentioned, I don't want to put more stuff in scanRTEForColumn() than I
already have, due to your earlier concern about adding clutter.

I think we're splitting hairs at this point, and frankly I'll do it
that way if it gets the patch closer to being committed. While I
thought it was important to get the unqualified and misspelled case
right (which I did in the first revision, but perhaps at the expense
of Tom's concern about absolute suggestion quality), I don't feel
strongly about this detail either way.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Alvaro Herrera
Peter Geoghegan wrote:
 On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark st...@mit.edu wrote:
  A simple rule is easier for users to understand as well as to code. I
  would humbly suggest the following: take all the unqualified column
  names, downcase them, check which ones match most closely the
  unmatched column. Show the top 3 matches if they're within some
  arbitrary distance.
 
 That's harder than it sounds. You need even more translatable strings
 for variant ereports().

Maybe it is possible to rephrase the message so that the translatable
part doesn't need to concern with how many suggestions there are.  For
instance something like perhaps you meant a name from the following
list: foo, bar, baz.  Couple with the errmsg_plural stuff, you then
don't need to worry too much about providing different strings for 1, 2,
N suggestions.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Peter Geoghegan
On Wed, Jul 9, 2014 at 2:19 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 That's harder than it sounds. You need even more translatable strings
 for variant ereports().

 Maybe it is possible to rephrase the message so that the translatable
 part doesn't need to concern with how many suggestions there are.  For
 instance something like perhaps you meant a name from the following
 list: foo, bar, baz.  Couple with the errmsg_plural stuff, you then
 don't need to worry too much about providing different strings for 1, 2,
 N suggestions.

That's not really the problem. I already have a lot of things to test
in each of the two ereport() calls. More importantly, showing the
closet, say, 3 matches under an arbitrary distance does not weigh
concerns about that indicating that they're all bad. It's not like
bash tab completion - if there is one best match, that's probably
because that's what the user meant. Whereas if there are two or more
within a single RTE, that's probably because both are unhelpful. They
both happened to require the same number of substitutions to get to,
while not being quite bad enough matches to be excluded by the final
check against a normalized distance threshold (the final check that
prevents ludicrous suggestions).

The fact that there were multiple equally plausible candidates (that
are not identically named and just from different RTEs) tells us
plenty, unlike with tab completion. It's not hard for one column to be
a better match than another, and so it doesn't seem unreasonable to
insist upon that within a single RTE where they cannot be identical,
since a conservative approach seems to be what is generally favored.
In any case I'm just trying to weigh everyone's concerns here. I hope
it's actually possible to compromise, but right now I don't know what
I can do to make useful progress.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Peter Geoghegan
On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 6) Sometimes no hints are returned... Even in simple cases like this one:
 =# create table foo (aa int, bb int);
 CREATE TABLE
 =# select ab from foo;
 ERROR:  42703: column ab does not exist
 LINE 1: select ab from foo;
^
 LOCATION:  errorMissingColumn, parse_relation.c:3123

In this example, it seems obvious that both aa and bb should be
suggested when they are not. But what if there were far more columns,
as might be expected in realistic cases (suppose all other columns
have at least 3 characters)? That's another kettle of fish. The
assumption that it's probably one of those two equally distant columns
is now on very shaky ground. After all, the user can only have meant
one particular column. If we apply a limited kind of Turing test to
this second case, how does the most recent revision's algorithm do?
What would a human suggest? I'm pretty sure the answer is that the
human would shrug. Maybe he or she would say I guess you might have
meant one of either aa or bb, but that really isn't obvious at all.
That doesn't inspire much confidence.

Now, maybe I should be more optimistic about it being one of the two
because there are only two possibilities to begin with. That seems
pretty dubious, though. In general I find it much more plausible based
on what we know that the user should rethink everything. And, as Tom
pointed out, showing nothing conveys something in itself once users
have been trained to expect something.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-08 Thread Michael Paquier
On Sat, Jul 5, 2014 at 12:46 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote:
 At 2014-07-02 15:51:08 -0700, p...@heroku.com wrote:

 Attached revision factors in everyone's concerns here, I think.

 Is anyone planning to review Peter's revised patch?

I have been doing some functional tests, and looked quickly at the
code to understand what it does:
1) Compiles without warnings, passes regression tests
2) Checking process goes through all the existing columns of a
relation even a difference of 1 with some other column(s) has already
been found. As we try to limit the number of hints returned, this
seems like a waste of resources.
3) distanceName could be improved, by for example having some checks
on the string lengths of target and source columns, and immediately
reject the match if for example the length of the source string is the
double/half of the length of target.
4) This is not nice, could it be possible to remove the stuff from varlena.c?
+/* Expand each Levenshtein distance variant */
+#include levenshtein.c
+#define LEVENSHTEIN_LESS_EQUAL
+#include levenshtein.c
+#undef LEVENSHTEIN_LESS_EQUAL
Part of the same comment: only varstr_leven_less_equal is used to
calculate the distance, should we really move varstr_leven to core?
This clearly needs to be reworked as not just a copy-paste of the
things in fuzzystrmatch.
The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.
5) Do we want hints on system columns as well? For example here we
could get tableoid as column hint:
=# select tablepid from foo;
ERROR:  42703: column tablepid does not exist
LINE 1: select tablepid from foo;
   ^
LOCATION:  errorMissingColumn, parse_relation.c:3123
Time: 0.425 ms
6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR:  42703: column ab does not exist
LINE 1: select ab from foo;
   ^
LOCATION:  errorMissingColumn, parse_relation.c:3123
7) Performance penalty with a table with 1600 columns:
=# CREATE FUNCTION create_long_table(tabname text, columns int)
RETURNS void
LANGUAGE plpgsql
as $$
declare
  first_col bool = true;
  count int;
  query text;
begin
  query := 'CREATE TABLE ' || tabname || ' (';
  for count in 0..columns loop
query := query || 'col' || count ||  ' int';
if count  columns then
  query := query || ', ';
end if;
  end loop;
  query := query || ')';
  execute query;
end;
$$;
=# SELECT create_long_table('aa', 1599);
 create_long_table
---

(1 row)
Then tested queries like that: SELECT col888a FROM aa;
Patched version: 2.100ms~2.200ms
master branch (6048896): 0.956 ms~0.990 ms
So the performance impact seems limited.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-08 Thread Peter Geoghegan
Hi,

On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 2) Checking process goes through all the existing columns of a
 relation even a difference of 1 with some other column(s) has already
 been found. As we try to limit the number of hints returned, this
 seems like a waste of resources.

In general it's possible that an exact match will later be found
within the RTE, and exact matches don't have to pay the wrong alias
penalty, and are immediately returned. It is therefore not a waste of
resources, but even if it was that would be pretty inconsequential as
your benchmark shows.

 3) distanceName could be improved, by for example having some checks
 on the string lengths of target and source columns, and immediately
 reject the match if for example the length of the source string is the
 double/half of the length of target.

I don't think it's a good idea to tie distanceName() to the ultimate
behavior of errorMissingColumn() hinting, since there may be other
callers in the future. Besides, that isn't going to help much.

 4) This is not nice, could it be possible to remove the stuff from varlena.c?
 +/* Expand each Levenshtein distance variant */
 +#include levenshtein.c
 +#define LEVENSHTEIN_LESS_EQUAL
 +#include levenshtein.c
 +#undef LEVENSHTEIN_LESS_EQUAL
 Part of the same comment: only varstr_leven_less_equal is used to
 calculate the distance, should we really move varstr_leven to core?
 This clearly needs to be reworked as not just a copy-paste of the
 things in fuzzystrmatch.
 The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.

So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.

 5) Do we want hints on system columns as well?

I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.

 6) Sometimes no hints are returned... Even in simple cases like this one:
 =# create table foo (aa int, bb int);
 CREATE TABLE
 =# select ab from foo;
 ERROR:  42703: column ab does not exist
 LINE 1: select ab from foo;
^
 LOCATION:  errorMissingColumn, parse_relation.c:3123

That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-08 Thread Alvaro Herrera
Peter Geoghegan wrote:

  6) Sometimes no hints are returned... Even in simple cases like this one:
  =# create table foo (aa int, bb int);
  CREATE TABLE
  =# select ab from foo;
  ERROR:  42703: column ab does not exist
  LINE 1: select ab from foo;
 ^
  LOCATION:  errorMissingColumn, parse_relation.c:3123
 
 That's because those two candidates come from a single RTE and have an
 equal distance -- you'd see both suggestions if you joined two tables
 with each candidate, assuming that each table being joined didn't
 individually have the same issue. I think that that's probably
 considered the correct behavior by most.

It seems pretty silly to me actually.  Was this designed by a committee?
I agree with the general principle that showing a large number of
candidates (a la bash) is a bad idea, but failing to show two of them ...

Words fail me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-08 Thread Peter Geoghegan
On Tue, Jul 8, 2014 at 1:42 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 That's because those two candidates come from a single RTE and have an
 equal distance -- you'd see both suggestions if you joined two tables
 with each candidate, assuming that each table being joined didn't
 individually have the same issue. I think that that's probably
 considered the correct behavior by most.

 It seems pretty silly to me actually.  Was this designed by a committee?
 I agree with the general principle that showing a large number of
 candidates (a la bash) is a bad idea, but failing to show two of them ...

I guess it was designed by a committee. But we don't fail to show both
because they're equally distant. Rather, it's because they're equally
distant and from the same RTE. This is a contrived example, but
typically showing equally distant columns is useful when they're in a
foreign-key relationship - I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't. I think that
in most realistic cases it wouldn't be all that useful to show two
columns from the same table when they're equally distant. It's easy to
imagine that reflecting that no match is good in absolute terms, and
we're somewhat conservative about showing any match. While I think
this general behavior is defensible, I must admit that it did suit me
to write it that way because to do otherwise would have necessitated
more invasive code in the existing general purpose scanRTEForColumn()
function.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-08 Thread Peter Geoghegan
On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote:
 I was worried about the common case where a
 column name is misspelled that would otherwise be ambiguous, which is
 why that shows a HINT while the single RTE case doesn't

To be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-02 15:51:08 -0700, p...@heroku.com wrote:

 Attached revision factors in everyone's concerns here, I think.

Is anyone planning to review Peter's revised patch?

 These new measures make the coding somewhat more complex than that of
 the initial version, although overall the parser code added by this
 patch is almost entirely confined to code paths concerned only with
 producing diagnostic messages to help users.

Yes, the new patch looks quite a bit more involved than earlier, but if
that's what it takes to provide a useful HINT, I guess it's not too bad.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-02 Thread Peter Geoghegan
On Sun, Jun 29, 2014 at 7:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Although printing all candidates seems to be what's preferred by
 existing systems with similar facilities, I can see the point that
 constructing the message in a translatable fashion might be difficult.
 So personally I'd be willing to abandon insistence on that.  I still
 think though that printing candidates with very large distances
 would be unhelpful.

Attached revision factors in everyone's concerns here, I think.

I've addressed your concern about the closeness of the match proposed
in the HINT - the absolute as opposed to relative quality of the
match. There is a normalized distance threshold that must always be
exceeded to prevent ludicrous suggestions. This works along similar
lines to those sketched by Robert. Furthermore, I've made it
occasionally possible to see 2 suggestions, when they're equally
distant and when each suggestion comes from a different range table
entry. However, if the two best suggestions (overall or within an RTE)
come from within the same RTE, then that RTE is ignored for the
purposes of picking a suggestion (although the lowest observed
distance from an ignored RTE may still be used as the distance for
later RTEs to beat to get their attributes suggested in the HINT).

The idea here is that this quality-bar for suggestions doesn't come at
the cost of ignoring my concern about the presumably somewhat common
case where there is an unqualified and therefore ambiguous column
reference that happens to also be misspelled. An ambiguous column
reference and an incorrectly spelled column name are both very common,
and so it seems likely that momentary lapses where the user gets both
things wrong at once are also common. We do all this without going
overboard, since as outlined by Robert, when there are 3 or more
equally distant candidates (even if they all come from different
RTEs), we give no HINT at all. The big picture here is to make mental
context switches cheap when writing ad-hoc queries in psql.

A lot of the HINTs that popped up in the regression tests that seemed
kind of questionable no longer appear. These new measures make the
coding somewhat more complex than that of the initial version,
although overall the parser code added by this patch is almost
entirely confined to code paths concerned only with producing
diagnostic messages to help users.

-- 
Peter Geoghegan


levenshtein_column_hint.v2.2014_07_02.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-06-29 Thread Abhijit Menon-Sen
So, what's the status of this patch?

There's been quite a lot of discussion (though only about the approach;
no formal code/usage review has yet been posted), but as far as I can
tell, it just tapered off without any particular consensus.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-06-29 Thread Tom Lane
Abhijit Menon-Sen a...@2ndquadrant.com writes:
 So, what's the status of this patch?
 There's been quite a lot of discussion (though only about the approach;
 no formal code/usage review has yet been posted), but as far as I can
 tell, it just tapered off without any particular consensus.

AFAICT, people generally agree that this would probably be useful,
but there's not consensus on how far the code should be willing to
reach for a match, nor on what to do when there are multiple
roughly-equally-plausible candidates.

Although printing all candidates seems to be what's preferred by
existing systems with similar facilities, I can see the point that
constructing the message in a translatable fashion might be difficult.
So personally I'd be willing to abandon insistence on that.  I still
think though that printing candidates with very large distances
would be unhelpful.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >