subject:"Re\: \[HACKERS\] Doing better at HINTing an appropriate column within errorMissingColumn\(\)"

On Wed, Nov 19, 2014 at 2:33 PM, Peter Geoghegan p...@heroku.com wrote:
 I don't think that's the case. Other RTEs are penalized for having
 non-matching aliases here.

The point I made did not, as far as I can see, have anything to do
with non-matching aliases; it could arise with just a single RTE.

 In general, I think the cost of a bad suggestion is much lower than
 the benefit of a good one. You seem to be suggesting that they're
 equal. Or that they're equally likely in an organic situation. In my
 estimation, this is not the case at all.

The way I see it, the main cost of a bad suggestion is that it annoys
the user with clutter which they may brand as stupid.  Think about
how much vitriol has been spewed over the years against progress bars
(or estimated completion) times that don't turn out to mirror reality.
Microsoft has gotten more cumulative flack about their inaccurate
progress bars over the years than they would have for dropping an
elevator on a cute baby.  Or think about how annoying it is when a
spell-checker or grammar-checker underlines something you've written
that is, in your own opinion, correctly spelled or grammatical.  Maybe
that kind of thing doesn't annoy you, but it definitely annoys me, and
I think probably a lot of other people.  My general experience is that
people get quite pissed off by bad suggestions from a computer.  At
least in my experience, users' actual level of agitation is often all
out of proportion to what might seem justified, but we are designing
this software for actual users, so their likely emotional reactions
are relevant.

 I'm curious about your thoughts on the compromise of a ramped up
 distance threshold to apply a test for the absolute quality of a
 match. I think that the fact that git gives bad suggestions with terse
 strings tells us a lot, though. Note that unlike git, with terse
 strings we may well have a good deal more equidistant matches, and as
 soon as the number of would-be matches exceeds 2, we actually give no
 matches at all. So that's an additional protection against poor
 matches with terse strings.

I don't know what you mean by a ramped-up distance threshold, exactly.
I think it's good for the distance threshold to be lower for small
strings and higher for large ones.  I think I'm somewhat open to
negotiation on the details, but I think any system that's going to
suggest quantity for tit is going too far.  If the user types
qty when they meant quantity, they probably don't really need the
hint, because they're going to say to themselves wait, I guess I
didn't abbreviate that.  The time when they need the hint is when
they typed quanttiy, because it's quite possible to read a query
with that sort of typo multiple times and not realize that you've made
one.  You're sitting there puzzling over where the quantity column
went, and asking yourselves how you can be mis-remembering the schema,
and saying wait, didn't I just see that column in the \d output ...
and you don't even think to check carefully for a spelling mistake.
The hint may well clue you in to what the real problem is.

In other words, I think there's value in trying to clue somebody in
when they've made a typo, but not when they've made a think-o.  We
won't be able to do the latter accurately enough to make it more
useful than annoying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 ... In other words, I think there's value in trying to clue somebody in
 when they've made a typo, but not when they've made a think-o.  We
 won't be able to do the latter accurately enough to make it more
 useful than annoying.

FWIW, I concur with Robert's analysis that wrong suggestions are likely to
be annoying.  We should be erring on the side of not making a suggestion
rather than making one that's a low-probability guess.

I'm not particularly convinced that the f1 - f2 example is a useful
behavior, and I'm downright horrified by the qty - quantity case.
If the hint mechanism thinks the latter two are close enough together
to suggest, it's going to be spewing a whole lot of utterly ridiculous
suggestions.  I'm going to be annoyed way more times than I'm going to
be helped.

The big picture is that this is more or less our first venture into
heuristic suggestions.  I think we should start slow with a very
conservative set of heuristics.  If it's a success maybe we can get more
aggressive over time --- but if we go over the top here, the entire
concept will be discredited in this community for the next ten years.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 8:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm not particularly convinced that the f1 - f2 example is a useful
 behavior, and I'm downright horrified by the qty - quantity case.
 If the hint mechanism thinks the latter two are close enough together
 to suggest, it's going to be spewing a whole lot of utterly ridiculous
 suggestions.  I'm going to be annoyed way more times than I'm going to
 be helped.

I happen to think that that isn't the case, because the number of
possible suggestions is fairly low anyway, and people don't tend to
make those kind of errors. Robert's examples of ridiculous
suggestions of quantity based on three letter strings other than
qty (e.g. tit) were rather contrived. In fact, most 3 letter
strings will not offer a suggestion. 3 or more Equidistant would-be
matches tend to offer a lot of additional protection against bad
suggestions for these terse strings.

 The big picture is that this is more or less our first venture into
 heuristic suggestions.  I think we should start slow with a very
 conservative set of heuristics.  If it's a success maybe we can get more
 aggressive over time --- but if we go over the top here, the entire
 concept will be discredited in this community for the next ten years.

I certainly see your point here. It's not as if we have an *evolved*
understanding of the usability issues. Besides, as Robert pointed out,
most of the value of this patch is added by simple cases, like a
failure to pluralize or not pluralize, or the omission of an
underscore.

I still think we should charge half for deletion, but I will concede
that it's prudent to apply a more restrictive absolute quality final
test.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 7:32 AM, Robert Haas robertmh...@gmail.com wrote:
In general, I think the cost of a bad suggestion is much lower than
the benefit of a good one. You seem to be suggesting that they're
equal. Or that they're equally likely in an organic situation. In my
estimation, this is not the case at all.

The way I see it, the main cost of a bad suggestion is that it annoys
the user with clutter which they may brand as stupid. Think about
how much vitriol has been spewed over the years against progress bars
(or estimated completion) times that don't turn out to mirror reality.

Well, you can judge the quality of the suggestion immediately. I
imagined a mechanism that gives a little bit more than the minimum
amount of guidance for things like contractions/abbreviations.

Microsoft has gotten more cumulative flack about their inaccurate
progress bars over the years than they would have for dropping an
elevator on a cute baby.

I haven't used a more recent version of Windows than Windows Vista,
but I'm pretty sure that they kept it up.

I'm curious about your thoughts on the compromise of a ramped up
distance threshold to apply a test for the absolute quality of a
match. I think that the fact that git gives bad suggestions with terse
strings tells us a lot, though. Note that unlike git, with terse
strings we may well have a good deal more equidistant matches, and as
soon as the number of would-be matches exceeds 2, we actually give no
matches at all. So that's an additional protection against poor
matches with terse strings.

I don't know what you mean by a ramped-up distance threshold, exactly.
I think it's good for the distance threshold to be lower for small
strings and higher for large ones. I think I'm somewhat open to
negotiation on the details, but I think any system that's going to
suggest quantity for tit is going too far.

I mean the suggestion of raising the cost threshold more gradually,
not as a step function of the number of characters in the string [1]
where it's either over 6 characters and must pass the 50% test, or
isn't and has no absolute quality test. The exact modification I
described will FWIW remove the quantity for qty suggestion, as
well as all the similar suggestions that you found objectionable (like
tit also offering a suggestion of quantity).

If you look at the regression tests, none of the sensible suggestions
are lost (some would be by an across the board 50% absolute quality
threshold, as I previously pointed out [2]), but all the bad ones are.
I attach failed regression test output showing the difference between
the previous expected values, and actual values with that small
modification - it looks like most or all bad cases are now fixed.

If the user types
qty when they meant quantity, they probably don't really need the
hint, because they're going to say to themselves wait, I guess I
didn't abbreviate that. The time when they need the hint is when
they typed quanttiy, because it's quite possible to read a query
with that sort of typo multiple times and not realize that you've made
one.

I agree that that's a more important case.

In other words, I think there's value in trying to clue somebody in
when they've made a typo, but not when they've made a think-o. We
won't be able to do the latter accurately enough to make it more
useful than annoying.

That's certainly true; I think that we only disagree about the exact
point at which we enter the think-o correction business.

[1]
http://www.postgresql.org/message-id/CAM3SWZT+7hH29Go6ZuY2OrCS40=6ypvm_nt9njfovp3xwji...@mail.gmail.com
[2]
http://www.postgresql.org/message-id/cam3swztsgoknht8rk+0eed7spnjg4padmbqqyi0fh9bwcnv...@mail.gmail.com
--
Peter Geoghegan

regression.diffs
Description: Binary data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 1:30 PM, Peter Geoghegan p...@heroku.com wrote:
 I mean the suggestion of raising the cost threshold more gradually,
 not as a step function of the number of characters in the string [1]
 where it's either over 6 characters and must pass the 50% test, or
 isn't and has no absolute quality test. The exact modification I
 described will FWIW remove the quantity for qty suggestion, as
 well as all the similar suggestions that you found objectionable (like
 tit also offering a suggestion of quantity).

 If you look at the regression tests, none of the sensible suggestions
 are lost (some would be by an across the board 50% absolute quality
 threshold, as I previously pointed out [2]), but all the bad ones are.
 I attach failed regression test output showing the difference between
 the previous expected values, and actual values with that small
 modification - it looks like most or all bad cases are now fixed.

That does seem to give better results, but it still seems awfully
complicated.  If we just used Levenshtein with all-default cost
factors and a distance cap equal to Max(strlen(what_user_typed),
strlen(candidate_match), 3), what cases that you think are important
would be harmed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote:
 That does seem to give better results, but it still seems awfully
 complicated.  If we just used Levenshtein with all-default cost
 factors and a distance cap equal to Max(strlen(what_user_typed),
 strlen(candidate_match), 3), what cases that you think are important
 would be harmed?

Well, just by plugging in default Levenshtein cost factors, I can see
the following regression:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
10:17:55.042291912 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
11:42:15.670108745 -0800
***
*** 3452,3458 
  ERROR:  column atts.relid does not exist
  LINE 1: select atts.relid::regclass, s.* from pg_stats s join
 ^
- HINT:  Perhaps you meant to reference the column atts.indexrelid.

Within the catalogs, the names of attributes are prefixed as a form of
what you might call internal namespacing. For example, pg_index has
attributes that all begin with ind*. You could easily omit something
like that, while still more or less knowing what you're looking for.

In more concrete terms, this gets no suggestion:

postgres=# select key from pg_index;
ERROR:  42703: column key does not exist
LINE 1: select key from pg_index;
   ^

Only this does:

postgres=# select ikey from pg_index;
ERROR:  42703: column ikey does not exist
LINE 1: select ikey from pg_index;
   ^
HINT:  Perhaps you meant to reference the column pg_index.indkey.
postgres=#

The git people varied their Levenshtein costings for a reason.

I also think that a one size fits all cap will break things. It will
independently break the example above, as well as the more marginal
c1.f2. vs c1.f2 case (okay, maybe that case was exactly on the
threshold, but others won't be).

I don't see that different costings actually saves any complexity.
Similarly, the final cap is quite straightforward. Anything with any
real complexity happens before that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 3:00 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote:
 That does seem to give better results, but it still seems awfully
 complicated.  If we just used Levenshtein with all-default cost
 factors and a distance cap equal to Max(strlen(what_user_typed),
 strlen(candidate_match), 3), what cases that you think are important
 would be harmed?

 Well, just by plugging in default Levenshtein cost factors, I can see
 the following regression:

 *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
 10:17:55.042291912 -0800
 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
 11:42:15.670108745 -0800
 ***
 *** 3452,3458 
   ERROR:  column atts.relid does not exist
   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
  ^
 - HINT:  Perhaps you meant to reference the column atts.indexrelid.

 Within the catalogs, the names of attributes are prefixed as a form of
 what you might call internal namespacing. For example, pg_index has
 attributes that all begin with ind*. You could easily omit something
 like that, while still more or less knowing what you're looking for.

 In more concrete terms, this gets no suggestion:

 postgres=# select key from pg_index;
 ERROR:  42703: column key does not exist
 LINE 1: select key from pg_index;
^

 Only this does:

 postgres=# select ikey from pg_index;
 ERROR:  42703: column ikey does not exist
 LINE 1: select ikey from pg_index;
^
 HINT:  Perhaps you meant to reference the column pg_index.indkey.
 postgres=#

Seems fine to me.  If you typed relid rather than indexrelid or key
rather than indkey, that's a thinko, not a typo.  ikey for indkey
could plausible be a typo, though you'd have to be having a fairly bad
day at the keyboard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas robertmh...@gmail.com wrote:
 Seems fine to me.  If you typed relid rather than indexrelid or key
 rather than indkey, that's a thinko, not a typo.  ikey for indkey
 could plausible be a typo, though you'd have to be having a fairly bad
 day at the keyboard.

I can tell that I have no chance of convincing you otherwise. While I
think you're mistaken to go against the precedent set by git, you're
the one with the commit bit, and I think we've already spent enough
time discussing this. So default costings it is.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Andres Freund

On 2014-11-20 12:00:51 -0800, Peter Geoghegan wrote:
 In more concrete terms, this gets no suggestion:
 
 postgres=# select key from pg_index;
 ERROR:  42703: column key does not exist
 LINE 1: select key from pg_index;
^

I don't think that's a bad thing. Yes, for a human those look pretty
similar, but it's easy to construct cases where that gives completely
hilarious results.

I think something simplistic like levenshtein, even with modified
distances, is good to catch typos. But not to find terms that are
related in more complex ways.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-20 Thread Tom Lane

Peter Geoghegan p...@heroku.com writes:
 On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas robertmh...@gmail.com wrote:
 That does seem to give better results, but it still seems awfully
 complicated.  If we just used Levenshtein with all-default cost
 factors and a distance cap equal to Max(strlen(what_user_typed),
 strlen(candidate_match), 3), what cases that you think are important
 would be harmed?

 Well, just by plugging in default Levenshtein cost factors, I can see
 the following regression:

 *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
 10:17:55.042291912 -0800
 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
 11:42:15.670108745 -0800
 ***
 *** 3452,3458 
   ERROR:  column atts.relid does not exist
   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
  ^
 - HINT:  Perhaps you meant to reference the column atts.indexrelid.


I do not have a problem with deciding that that is not a regression;
in fact, not giving that hint seems like a good conservative behavior
here.  By your logic, we should also be prepared to suggest
supercalifragilisticexpialidocious when the user enters ocious.
It's simply a bridge too far for what is supposed to be a hint for
minor typos.  You sound like you want to turn it into something that
will look up column names for people who are too lazy to even try to
type the right thing.  While I can see the value of such a tool within
certain contexts, firing completed queries at a live SQL engine
is not one of them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 12:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I do not have a problem with deciding that that is not a regression;
 in fact, not giving that hint seems like a good conservative behavior
 here.  By your logic, we should also be prepared to suggest
 supercalifragilisticexpialidocious when the user enters ocious.

That's clearly not true. I just want to be a bit more forgiving of
omissions. That clearly isn't my logic, since that isn't a suggestion
that the implementation will give, or would be anywhere close to
giving - my weighing of deletions is only twice that of substitutions
or insertions, not ten times. git does not use Levenshtein default
costings either.

 It's simply a bridge too far for what is supposed to be a hint for
 minor typos.

Minor typos and minor omissions. My example was on the edge of what
would be tolerable under my proposed cost model.

 You sound like you want to turn it into something that
 will look up column names for people who are too lazy to even try to
 type the right thing.  While I can see the value of such a tool within
 certain contexts, firing completed queries at a live SQL engine
 is not one of them.

It's just a hint; a convenience. Users who imagine that it takes away
the need for putting any thought into their SQL queries have bigger
problems.

Anyway, that's all that needs to be said on that, since I've already
given up on a non-default costing. Also, we have default costing, and
we always apply a 50% standard (I see no point in doing otherwise with
default costings).

Robert: Where does that leave us? What about suggestions across RTEs?
Alias costing, etc?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 3:20 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas robertmh...@gmail.com wrote:
 Seems fine to me.  If you typed relid rather than indexrelid or key
 rather than indkey, that's a thinko, not a typo.  ikey for indkey
 could plausible be a typo, though you'd have to be having a fairly bad
 day at the keyboard.

 I can tell that I have no chance of convincing you otherwise. While I
 think you're mistaken to go against the precedent set by git, you're
 the one with the commit bit, and I think we've already spent enough
 time discussing this. So default costings it is.

I've got a few +1s, too, if you notice.

I'm willing to be outvoted, but not by a majority of one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 12:50 PM, Robert Haas robertmh...@gmail.com wrote:
 I've got a few +1s, too, if you notice.

Then maybe I spoke too soon.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Robert Haas

On Tue, Nov 18, 2014 at 8:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Geoghegan p...@heroku.com writes:
 On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 HINT:  Perhaps you meant to reference the column orderlines.quantity.

 I don't buy this example, because it would give you the same hint if
 you told it you wanted to access a column called ant, or uay, or tit.
 And that's clearly ridiculous.  The reason why quantity looks like a
 reasonable suggestion for qty is because it's a conventional
 abbreviation, but an extremely high percentage of comparable cases
 won't be.

 I maintain that omission of part of the correct spelling should be
 weighed less.

 I would say that omission of the first letter should completely disqualify
 suggestions based on this heuristic; but it might make sense to weight
 omissions less after the first letter.

I think we would be well-advised not to start inventing our own
approximate matching algorithm.  Peter's suggestion boils down to a
guess that the default cost parameters for Levenshtein suck, and your
suggestion boils down to a guess that we can fix the problems with
Peter's suggestion by bolting another heuristic on top of it - and
possibly running Levenshtein twice with different sets of cost
parameters.  Ugh.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas robertmh...@gmail.com wrote:
 I think we would be well-advised not to start inventing our own
 approximate matching algorithm.  Peter's suggestion boils down to a
 guess that the default cost parameters for Levenshtein suck, and your
 suggestion boils down to a guess that we can fix the problems with
 Peter's suggestion by bolting another heuristic on top of it - and
 possibly running Levenshtein twice with different sets of cost
 parameters.  Ugh.

I agree.

While I am perfectly comfortable with the fact that we are guessing
here, my guesses are based on what I observed to work well with real
schemas, and simulated errors that I thought were representative of
human error. Obviously it's possible that another scheme will do
better sometimes, including for example a scheme that picks a match
entirely at random. But on average, I think that what I have here will
do better than anything else proposed so far.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Robert Haas

On Wed, Nov 19, 2014 at 12:33 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas robertmh...@gmail.com wrote:
 I think we would be well-advised not to start inventing our own
 approximate matching algorithm.  Peter's suggestion boils down to a
 guess that the default cost parameters for Levenshtein suck, and your
 suggestion boils down to a guess that we can fix the problems with
 Peter's suggestion by bolting another heuristic on top of it - and
 possibly running Levenshtein twice with different sets of cost
 parameters.  Ugh.

 I agree.

 While I am perfectly comfortable with the fact that we are guessing
 here, my guesses are based on what I observed to work well with real
 schemas, and simulated errors that I thought were representative of
 human error. Obviously it's possible that another scheme will do
 better sometimes, including for example a scheme that picks a match
 entirely at random. But on average, I think that what I have here will
 do better than anything else proposed so far.

If you agree, then I'm not being clear enough.  I don't think think
that tinkering with the Levenshtein cost factors is a good idea, and I
think it's unhelpful to suggest something when the suggestion and the
original word differ by more than 50% of the number characters in the
shorter word.  Suggesting col for oid or x for xmax, as crops
up in the regression tests with this patch applied, shows the folly of
this: the user didn't mean the other named column; rather, the user
was confused about whether a particular system column existed for that
table.

If we had a large database of examples showing what the user typed and
what they intended, we could try different algorithms against it and
see which one performs best with fewest false positives.  But if we
don't have that, we should do things that are like the things that
other people have done before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas robertmh...@gmail.com wrote:
 If you agree, then I'm not being clear enough.  I don't think think
 that tinkering with the Levenshtein cost factors is a good idea, and I
 think it's unhelpful to suggest something when the suggestion and the
 original word differ by more than 50% of the number characters in the
 shorter word.

I agree - except for very short strings, where there is insufficient
information to go on.

I was talking about the difficulty of bolting something on top of
Levenshtein distance that looked at the first character. That would
not need two costings, but would require an encoding aware matching of
the first code point.

 Suggesting col for oid or x for xmax, as crops
 up in the regression tests with this patch applied, shows the folly of
 this: the user didn't mean the other named column; rather, the user
 was confused about whether a particular system column existed for that
 table.

Those are all very terse strings. What you're overlooking is what is
broken by using straight Levenshtein distance, which includes things
in the regression test that are reasonable and helpful. As I mentioned
before, requiring a greater than 50% of total string size distance
breaks this, just within the regression tests:


  ERROR:  column f1 does not exist
  LINE 1: select f1 from c1;
 ^
- HINT:  Perhaps you meant to reference the column c1.f2.


And:


  ERROR:  column atts.relid does not exist
  LINE 1: select atts.relid::regclass, s.* from pg_stats s join
 ^
- HINT:  Perhaps you meant to reference the column atts.indexrelid.


Those are really useful suggestions! And, they're much more
representative of real user error.

The downside of weighing deletion less than substitution and insertion
is much smaller than the upside. It's worth it. The downside is only
that the user gets to see the best suggestion that isn't all that good
in an absolute sense (which we have a much harder time concluding
using simple tests for short misspellings).

 If we had a large database of examples showing what the user typed and
 what they intended, we could try different algorithms against it and
 see which one performs best with fewest false positives.  But if we
 don't have that, we should do things that are like the things that
 other people have done before.

That seems totally impractical. No one has that kind of data that I'm aware of.

How about git as a kind of precedent? It is not at all conservative
about showing *some* suggestion:


$ git aa
git: 'aa' is not a git command. See 'git --help'.

Did you mean this?
am

$ git d
git: 'd' is not a git command. See 'git --help'.

Did you mean one of these?
diff
add

$ git ddd
git: 'ddd' is not a git command. See 'git --help'.

Did you mean this?
add


And why wouldn't git be? As far as its concerned, you can only have
meant one of those small number of things. Similarly, with the patch,
the number of things we can pick from is fairly limited at this stage,
since we are actually fairly far along with parse analysis.

Now, this won't give a suggestion:


$ git 
git: '' is not a git command. See 'git --help'.


So it looks like git similarly weighs deletion less than
insertion/substitution. Are its suggestions any less ridiculous than
your examples of questionable hints from the modified regression test
expected output? This is just a guidance mechanism, and at worst we'll
show the best match (on the mechanisms own terms, which isn't too
bad).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 10:22 AM, Peter Geoghegan p...@heroku.com wrote:
 Those are all very terse strings. What you're overlooking is what is
 broken by using straight Levenshtein distance, which includes things
 in the regression test that are reasonable and helpful. As I mentioned
 before, requiring a greater than 50% of total string size distance
 breaks this, just within the regression tests:

Maybe you'd prefer if there was a more gradual ramp-up to requiring a
distance of no greater than 50% of the string size (normalized to take
account of my non-default costings). Right now it's a step function of
the number of characters in the string - there is no absolute
quality requirement for strings of 6 or fewer requirements.
Otherwise, there is the 50% distance absolute quality test (the test
that you want to be applied generally). I think that would be better,
without being much more complicated.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan p...@heroku.com wrote:
 there is no absolute
 quality requirement for strings of 6 or fewer requirements.

I meant 6 or fewer *characters*, obviously.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan p...@heroku.com wrote:
 Maybe you'd prefer if there was a more gradual ramp-up to requiring a
 distance of no greater than 50% of the string size (normalized to take
 account of my non-default costings)

I made this modification:

diff --git a/src/backend/parser/parse_relation.c
b/src/backend/parser/parse_relation.c
index 40c69d7..cca075f 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -929,7 +929,8 @@ searchRangeTableForCol(ParseState *pstate, const
char *alias, char *colname,
 * seen when 6 deletions are required against actual attribute
name, or 3
 * insertions/substitutions.
 */
-   if (state-distance  6  state-distance  strlen(colname) / 2)
+   if ((state-distance  3  state-distance  strlen(colname)) ||
+   (state-distance  6  state-distance  strlen(colname) / 2))
{
state-rsecond = state-rfirst = NULL;
state-second = state-first = InvalidAttrNumber;


When I run the regression tests now, then all the cases that you found
objectionable in the regression tests' previous expected output
disappear, while all the cases I think are useful that were previously
removed by applying a broad 50% standard remain. While I'm not 100%
sure that this exact formulation is the best one, I think that we can
reach a compromise on this point, that allows the costing to remain
the same without offering particularly bad suggestions for short
strings.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Robert Haas

On Wed, Nov 19, 2014 at 1:22 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas robertmh...@gmail.com wrote:
 If you agree, then I'm not being clear enough.  I don't think think
 that tinkering with the Levenshtein cost factors is a good idea, and I
 think it's unhelpful to suggest something when the suggestion and the
 original word differ by more than 50% of the number characters in the
 shorter word.

 I agree - except for very short strings, where there is insufficient
 information to go on.

That's precisely the time I think it's *most* important.  In a very
long string, the threshold should be LESS than 50%.  My original
proposal was no more than 3 characters of difference, but in any
event not more than half the length of the shorter string.

 Suggesting col for oid or x for xmax, as crops
 up in the regression tests with this patch applied, shows the folly of
 this: the user didn't mean the other named column; rather, the user
 was confused about whether a particular system column existed for that
 table.

 Those are all very terse strings. What you're overlooking is what is
 broken by using straight Levenshtein distance, which includes things
 in the regression test that are reasonable and helpful. As I mentioned
 before, requiring a greater than 50% of total string size distance
 breaks this, just within the regression tests:

 
   ERROR:  column f1 does not exist
   LINE 1: select f1 from c1;
  ^
 - HINT:  Perhaps you meant to reference the column c1.f2.
 

That's exactly 50%, not more than 50%.

(I'm also on the fence about whether the hint is actually helpful in
that case, but the rule I proposed wouldn't prohibit it.)

 And:

 
   ERROR:  column atts.relid does not exist
   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
  ^
 - HINT:  Perhaps you meant to reference the column atts.indexrelid.
 

 Those are really useful suggestions! And, they're much more
 representative of real user error.

That one's right at 50% too, but it's certainly more than 3 characters
of difference.  I think it's going to be pretty hard to emit a
suggestion in that case but not in a whole lot of cases that don't
make any sense.

 How about git as a kind of precedent? It is not at all conservative
 about showing *some* suggestion:

 
 $ git aa
 git: 'aa' is not a git command. See 'git --help'.

 Did you mean this?
 am

 $ git d
 git: 'd' is not a git command. See 'git --help'.

 Did you mean one of these?
 diff
 add

 $ git ddd
 git: 'ddd' is not a git command. See 'git --help'.

 Did you mean this?
 add
 

 And why wouldn't git be? As far as its concerned, you can only have
 meant one of those small number of things. Similarly, with the patch,
 the number of things we can pick from is fairly limited at this stage,
 since we are actually fairly far along with parse analysis.

I went and found the actual code git uses for this.  It's here:

https://github.com/git/git/blob/d29e9c89dbbf0876145dc88615b99308cab5f187/help.c

And the underlying Levenshtein implementation is here:

https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c

Apparently what they're doing is charging 0 for a transposition (which
we don't have as a separate concept), 2 for a substitution, 1 for an
insertion, and 3 for a deletion, with the constraint that anything
with a total distance of more than 6 isn't considered.  And that does
overall seem to give pretty good suggestions.  However, an interesting
point about the git algorithm is that it's not hard to make it do
stupid things on short strings:

[rhaas pgsql]$ git xy
git: 'xy' is not a git command. See 'git --help'.

Did you mean one of these?
am
gc
mv
rm

Maybe they should adopt my idea of a lower cutoff for short strings.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote:
 Apparently what they're doing is charging 0 for a transposition (which
 we don't have as a separate concept), 2 for a substitution, 1 for an
 insertion, and 3 for a deletion, with the constraint that anything
 with a total distance of more than 6 isn't considered.  And that does
 overall seem to give pretty good suggestions.

The git people know that no git command is longer than 4 or 5
characters. That doesn't apply to us. I certainly would not like to
have an absolute distance test of n characters.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote:
 That's precisely the time I think it's *most* important.  In a very
 long string, the threshold should be LESS than 50%.  My original
 proposal was no more than 3 characters of difference, but in any
 event not more than half the length of the shorter string.

We can only hint based on the information given by the user. If they
give a lot of badly matching information, we have something to go on.

 That one's right at 50% too, but it's certainly more than 3 characters
 of difference.  I think it's going to be pretty hard to emit a
 suggestion in that case but not in a whole lot of cases that don't
 make any sense.

I don't think that's the case. Other RTEs are penalized for having
non-matching aliases here.

In general, I think the cost of a bad suggestion is much lower than
the benefit of a good one. You seem to be suggesting that they're
equal. Or that they're equally likely in an organic situation. In my
estimation, this is not the case at all.

I'm curious about your thoughts on the compromise of a ramped up
distance threshold to apply a test for the absolute quality of a
match. I think that the fact that git gives bad suggestions with terse
strings tells us a lot, though. Note that unlike git, with terse
strings we may well have a good deal more equidistant matches, and as
soon as the number of would-be matches exceeds 2, we actually give no
matches at all. So that's an additional protection against poor
matches with terse strings.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Alvaro Herrera

Robert Haas wrote:

 And the underlying Levenshtein implementation is here:
 
 https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c
 
 Apparently what they're doing is charging 0 for a transposition (which
 we don't have as a separate concept), 2 for a substitution, 1 for an
 insertion, and 3 for a deletion, with the constraint that anything
 with a total distance of more than 6 isn't considered.

0 for a transposition, wow. I suggested adding transpositions but there
was no support for that idea.  I suggested it because I thikn it's the
most common form of typo, and charging 2 for a deletion plus 1 for an
insertion makes a single transposition mistaek count as 3, which seems
wrong -- particularly seeing the git precedent.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 0 for a transposition, wow.

Again, they're optimizing for short strings (git commands) only. There
just isn't that many transposition errors possible with a 4 character
string.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Alvaro Herrera

Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  0 for a transposition, wow.
 
 Again, they're optimizing for short strings (git commands) only. There
 just isn't that many transposition errors possible with a 4 character
 string.

If there's logic in your statement, I can't see it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  0 for a transposition, wow.

 Again, they're optimizing for short strings (git commands) only. There
 just isn't that many transposition errors possible with a 4 character
 string.

 If there's logic in your statement, I can't see it.

The point is that transposition errors should not have no cost. If git
did not have an absolute quality test of a distance of 6, which they
can only have because all git commands are terse, then you could
construct a counter example.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-19 Thread Alvaro Herrera

Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  Peter Geoghegan wrote:
  On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
  alvhe...@2ndquadrant.com wrote:
   0 for a transposition, wow.
 
  Again, they're optimizing for short strings (git commands) only. There
  just isn't that many transposition errors possible with a 4 character
  string.
 
  If there's logic in your statement, I can't see it.
 
 The point is that transposition errors should not have no cost. If git
 did not have an absolute quality test of a distance of 6, which they
 can only have because all git commands are terse, then you could
 construct a counter example.

Okay.  My point is just that whatever the string length, I think we'd do
well to regard transpositions as cheap in terms of error cost.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-18 Thread Robert Haas

On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 LINE 1: select qty from orderlines ;
^
 HINT:  Perhaps you meant to reference the column orderlines.quantity.
 

I don't buy this example, because it would give you the same hint if
you told it you wanted to access a column called ant, or uay, or tit.
And that's clearly ridiculous.  The reason why quantity looks like a
reasonable suggestion for qty is because it's a conventional
abbreviation, but an extremely high percentage of comparable cases
won't be.

 +/*
 + * Charge extra (for inexact matches only) when an alias was
 + * specified that differs from what might have been used to
 + * correctly qualify this RTE's closest column
 + */
 +if (wrongalias)
 +rtestate.distance += 3;

 I don't understand what situation this is catering to.  Can you
 explain?  It seems to account for a good deal of complexity.

 Two cases:

 1. Distinguishing between the case where there was an exact match to a
 column that isn't visible (i.e. the existing reason for
 errorMissingColumn() to call here), and the case where there is a
 visible column, but our alias was the wrong one. I guess that could
 live in errorMissingColumn(), but overall it's more convenient to do
 it here, so that errorMissingColumn() handles things almost uniformly
 and doesn't really have to care.

 2. For non-exact (fuzzy) matches, it seems more useful to give one
 match rather than two when the user gave an alias that matches one
 particular RTE. Consider this:

 
 postgres=# select ordersid from orders o join orderlines ol on
 o.orderid = ol.orderid;
 ERROR:  42703: column ordersid does not exist
 LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
 HINT:  Perhaps you meant to reference the column o.orderid or the
 column ol.orderid.
 LOCATION:  errorMissingColumn, parse_relation.c:3166

 postgres=# select ol.ordersid from orders o join orderlines ol on
 o.orderid = ol.orderid;
 ERROR:  42703: column ol.ordersid does not exist
 LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
 HINT:  Perhaps you meant to reference the column ol.orderid.
 LOCATION:  errorMissingColumn, parse_relation.c:3147
 

I guess I'm confused at a broader level.  If the alias is wrong, why
are we considering names in this RTE *at all*?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-18 Thread Peter Geoghegan

On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 LINE 1: select qty from orderlines ;
^
 HINT:  Perhaps you meant to reference the column orderlines.quantity.
 

 I don't buy this example, because it would give you the same hint if
 you told it you wanted to access a column called ant, or uay, or tit.
 And that's clearly ridiculous.  The reason why quantity looks like a
 reasonable suggestion for qty is because it's a conventional
 abbreviation, but an extremely high percentage of comparable cases
 won't be.

Is that so terrible? Yes, if those *exact* strings are tried, that'll
happen. But the vast majority of 3 letter strings will not do that
(including many 3 letter strings that include one of the letters 'q',
't' and 'y', such as qqq, ttt, and yyy). Why, in practice, would
someone even attempt those strings? I'm worried about Murphy, not
Machiavelli. That seems like a pretty important distinction here.

I maintain that omission of part of the correct spelling should be
weighed less. I am optimizing for the case where the user has a rough
idea of the structure and spelling of things - if they're typing in
random strings, or totally distinct synonyms, there is little we can
do about that. As I said, there will always be the most marginal case
that still gets a suggestion. I see no reason to hurt the common case
where we help in order to save the user from seeing a ridiculous
suggestion. I have a final test for the absolute quality of a
suggestion, but I think we could easily be too conservative about
that. At worst, our ridiculous suggestion makes apparent that the
user's incorrect spelling was itself ridiculous. With larger strings,
we can afford to be more conservative, and we are, because we have
more information to go on. Terse column names are not uncommon,
though.

 +/*
 + * Charge extra (for inexact matches only) when an alias was
 + * specified that differs from what might have been used to
 + * correctly qualify this RTE's closest column
 + */
 +if (wrongalias)
 +rtestate.distance += 3;

 I don't understand what situation this is catering to.  Can you
 explain?  It seems to account for a good deal of complexity.

 I guess I'm confused at a broader level.  If the alias is wrong, why
 are we considering names in this RTE *at all*?

Because it's a common mistake when writing ad-hoc queries. People may
forget which exact table their column comes from. We certainly want to
weigh the fact that an alias was specified, but it shouldn't totally
limit our guess to that RTE. If nothing else, the fact that there was
a much closer match from another RTE ought to result in forcing there
to be no suggestion (due to there being too many equally good
suggestions). That's because, as I said, an *absolute* test for the
quality of a match is problematic (which, again, is why I err on the
side of letting the final, absolute quality test not limit
suggestions, particularly with short strings).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-18 Thread Tom Lane

Peter Geoghegan p...@heroku.com writes:
 On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:
 postgres=# select qty from orderlines ;
 ERROR:  42703: column qty does not exist
 HINT:  Perhaps you meant to reference the column orderlines.quantity.

 I don't buy this example, because it would give you the same hint if
 you told it you wanted to access a column called ant, or uay, or tit.
 And that's clearly ridiculous.  The reason why quantity looks like a
 reasonable suggestion for qty is because it's a conventional
 abbreviation, but an extremely high percentage of comparable cases
 won't be.

 I maintain that omission of part of the correct spelling should be
 weighed less.

I would say that omission of the first letter should completely disqualify
suggestions based on this heuristic; but it might make sense to weight
omissions less after the first letter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-17 Thread Robert Haas

On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch incorporates this feedback.

 The only user-visible difference between this revision and the
 previous revision is that it's quite possible for two suggestion to
 originate from the same RTE (there is exactly one change in the
 regression test's expected output as compared to the last revision for
 this reason. The regression tests are otherwise unchanged). It's still
 not possible to see more than 2 suggestions under any circumstances,
 no matter where they might have originated from, which I think is
 appropriate -- we continue to not present any HINT in the event of 3
 or more equidistant matches.

 I think that the restructuring required to pass around a state
 variable has resulted in somewhat clearer code.

Cool!

I'm grumpy about the distanceName() function.  That seems too generic.
If we're going to keep this as it is, I suggest something like
computeRTEColumnDistance().  But see below.

On a related note, I'm also grumpy about this comment:

+/* Charge half as much per deletion as per insertion or per substitution */
+return varstr_levenshtein_less_equal(actual, len, match, match_len,
+ 2, 1, 2, max);

The purpose of a code comment is to articulate WHY we did something,
rather than simply to restate what the code quite obviously does.  I
haven't heard a compelling argument for why this should be 2, 1, 2
rather than the default 1, 1, 1; and I'm inclined to do the latter
unless you can make some very good argument for this combination of
weights.  And if you can make such an argument, then there should be
comments so that the next person to come along and look at this code
doesn't go, huh, that's whacky, and change it.

+ int location, FuzzyAttrMatchState *rtestate)

I suggest calling this fuzzystate rather than rtestate; it's not
the state of the RTE, but the state of the fuzzy matching.

Within the scanRTEForColumn block, we have a rather large chunk of
code protected by if (rtestate), which contains the only call to
distanceName().  I suggest that we move all of this logic to a
separate, static function, and merge distanceName into it.  I also
suggest testing against NULL explicitly instead of implicitly.  So
this block of code would end up as something like:

if (fuzzystate != NULL)
updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate);

In searchRangeTableForCol, I'm fairly certain that you've changed the
behavior by adding a check for if (rte-rtekind == RTE_JOIN) before
the call to scanRTEForColumn().  Why not instead put this check into
updateFuzzyAttrMatchState?  Then you can be sure you're not changing
the behavior in any other case.

On a similar note, I think the dropped-column test should happen early
as well, probably again in updateFuzzyAttrMatchState().  There's
little point in adding a suggestion only to throw it away again.

+/*
+ * Charge extra (for inexact matches only) when an alias was
+ * specified that differs from what might have been used to
+ * correctly qualify this RTE's closest column
+ */
+if (wrongalias)
+rtestate.distance += 3;

I don't understand what situation this is catering to.  Can you
explain?  It seems to account for a good deal of complexity.

 ERROR:  column oid does not exist
 LINE 1: select oid  0, * from altwithoid;
^
+HINT:  Perhaps you meant to reference the column altwithoid.col.

That seems like a stretch.  I think I suggested before using a
distance threshold of at most 3 or half the word length, whichever is
less.  For a three-letter column name that means not suggesting
anything if more than one character is different.  What you
implemented here is close to that, yet somehow we've got a suggestion
slipping through that has 2 out of 3 characters different.  I'm not
quite sure I see how that's getting through, but I think it shouldn't.

 ERROR:  column fullname.text does not exist
 LINE 1: select fullname.text from fullname;
^
+HINT:  Perhaps you meant to reference the column fullname.last.

Same problem, only worse!  They've only got one letter of four in common.

 ERROR:  column xmin does not exist
 LINE 1: select xmin, * from fooview;
^
+HINT:  Perhaps you meant to reference the column fooview.x.

Basically the same problem again.  I think the distance threshold in
this case should be half the shorter column name, i.e. 0.

Your new test cases include no negative test cases; that is, cases
where the machinery declines to suggest a hint because of, say, 3
equally good possibilities.  They probably should have something like
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-17 Thread Peter Geoghegan

On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas robertmh...@gmail.com wrote:
 I'm grumpy about the distanceName() function.  That seems too generic.
 If we're going to keep this as it is, I suggest something like
 computeRTEColumnDistance().  But see below.

Fair point.

 On a related note, I'm also grumpy about this comment:

 +/* Charge half as much per deletion as per insertion or per substitution 
 */
 +return varstr_levenshtein_less_equal(actual, len, match, match_len,
 + 2, 1, 2, max);

 The purpose of a code comment is to articulate WHY we did something,
 rather than simply to restate what the code quite obviously does.  I
 haven't heard a compelling argument for why this should be 2, 1, 2
 rather than the default 1, 1, 1; and I'm inclined to do the latter
 unless you can make some very good argument for this combination of
 weights.  And if you can make such an argument, then there should be
 comments so that the next person to come along and look at this code
 doesn't go, huh, that's whacky, and change it.

Okay. I agree that that deserves a comment. The actual argument for
this formulation is that it just seems to work better that way. For
example:


postgres=# \d orderlines
 Table public.orderlines
   Column|   Type   | Modifiers
-+--+---
 orderlineid | integer  | not null
 orderid | integer  | not null
 prod_id | integer  | not null
 quantity| smallint | not null
 orderdate   | date | not null

postgres=# select qty from orderlines ;
ERROR:  42703: column qty does not exist
LINE 1: select qty from orderlines ;
   ^
HINT:  Perhaps you meant to reference the column orderlines.quantity.


The point is that the fact that the user supplied qty string has so
many fewer characters than what was obviously intended - quantity -
deserves to be weighed less. If you change the costing to weigh
character deletion as being equal to substitution/addition, this
example breaks. I also think it's pretty common to have noise words in
every attribute (e.g. every column in the orderlines table matches
orderlines_*), which might otherwise mess things up by overcharging
for deletion. Having extra characters in the correctly spelled column
name seems legitimately less significant to me.

Or, in other words: having actual characters from the misspelling
match the correct spelling (and having actual characters given not
fail to match) is most important. What was given by the user is more
important than what was not given but should have been, which is not
generally true for uses of Levenshtein distance. I reached this
conclusion through trying out the patch with a couple of real schemas,
and seeing what works best.

It's hard to express that idea tersely, in a comment, but I guess I'll try.

 + int location, FuzzyAttrMatchState *rtestate)

 I suggest calling this fuzzystate rather than rtestate; it's not
 the state of the RTE, but the state of the fuzzy matching.

The idea here was to differentiate this state from the overall range
table state (in general, FuzzyAttrMatchState may be one or the other).
But okay.

 Within the scanRTEForColumn block, we have a rather large chunk of
 code protected by if (rtestate), which contains the only call to
 distanceName().  I suggest that we move all of this logic to a
 separate, static function, and merge distanceName into it.  I also
 suggest testing against NULL explicitly instead of implicitly.  So
 this block of code would end up as something like:

 if (fuzzystate != NULL)
 updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate);

Okay.

 In searchRangeTableForCol, I'm fairly certain that you've changed the
 behavior by adding a check for if (rte-rtekind == RTE_JOIN) before
 the call to scanRTEForColumn().  Why not instead put this check into
 updateFuzzyAttrMatchState?  Then you can be sure you're not changing
 the behavior in any other case.

I thought that I had avoided changing things (beyond what was
advertised as changed in relation to this most recent revision)
because I also changed things WRT multiple matches per RTE. It's
fuzzy. Anyway, yeah, I could do it there instead.

 On a similar note, I think the dropped-column test should happen early
 as well, probably again in updateFuzzyAttrMatchState().  There's
 little point in adding a suggestion only to throw it away again.

Agreed.

 +/*
 + * Charge extra (for inexact matches only) when an alias was
 + * specified that differs from what might have been used to
 + * correctly qualify this RTE's closest column
 + */
 +if (wrongalias)
 +rtestate.distance += 3;

 I don't understand what situation this is catering to.  Can you
 explain?  It seems to account for a good deal of complexity.

Two cases:

1. Distinguishing between the case where there was an exact match to a
column that isn't visible

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-15 Thread Peter Geoghegan

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote:
 On that topic, I think there's unanimous consensus against the design
 where equally-distant matches are treated differently based on whether
 they are in the same RTE or different RTEs.  I think you need to
 change that if you want to get anywhere with this.  On a related note,
 the use of the additional parameter AttrNumber closest[2] to
 searchRangeTableForCol() and of the additional parameters AttrNumber
 *matchedatt and int *distance to scanRTEForColumn() is less than
 self-documenting.  I suggest creating a structure called something
 like FuzzyAttrMatchState and passing a pointer to it down to both
 functions.

Attached patch incorporates this feedback.

The only user-visible difference between this revision and the
previous revision is that it's quite possible for two suggestion to
originate from the same RTE (there is exactly one change in the
regression test's expected output as compared to the last revision for
this reason. The regression tests are otherwise unchanged). It's still
not possible to see more than 2 suggestions under any circumstances,
no matter where they might have originated from, which I think is
appropriate -- we continue to not present any HINT in the event of 3
or more equidistant matches.

I think that the restructuring required to pass around a state
variable has resulted in somewhat clearer code.

-- 
Peter Geoghegan
From 0aef5253f10ebb1ee5bbcc73782eff1352c7ab84 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Wed, 12 Nov 2014 15:31:37 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in an ad-hoc SQL query, or
incorrectly pluralizes or fails to pluralize a column reference, or
incorrectly omits or includes an underscore or other punctuation
character.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (these
may come from multiple RTEs, or the same RTE).  Limiting to two the
number of cases where multiple equally likely suggestions are all
offered at once (i.e.  giving no hint when the number of equally likely
candidates exceeds two) is a measure against suggestions that are of low
quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.
---
 src/backend/parser/parse_expr.c   |   9 +-
 src/backend/parser/parse_func.c   |   2 +-
 src/backend/parser/parse_relation.c   | 345 +++---
 src/backend/utils/adt/levenshtein.c   |   9 +
 src/include/parser/parse_relation.h   |  20 +-
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out|  39 
 src/test/regress/expected/plpgsql.out |   1 +
 src/test/regress/expected/rowtypes.out|   1 +
 src/test/regress/expected/rules.out   |   1 +
 src/test/regress/expected/without_oid.out |   2 +
 src/test/regress/sql/join.sql |  24 +++
 12 files changed, 421 insertions(+), 40 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..a77a3a0 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field2);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field3);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 colname = strVal(field4);
 
 /* Try to identify as a column of the RTE */
-node = scanRTEForColumn(pstate, rte, colname, cref-location);
+node = scanRTEForColumn(pstate, rte, colname, cref-location,
+		NULL);
 if (node == NULL)
 {
 	/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..472e15e 100644
--- a/src/backend/parser/parse_func.c
+++

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Robert Haas

On Wed, Nov 12, 2014 at 10:42 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan p...@heroku.com wrote:
 I'm pretty puzzled by this. Other than our agree to disagree and
 defer to committer position on the question of whether or not more
 than one suggestion can come from a single RTE, which you were fine
 with before [1], I have only restored the core/contrib separation to a
 state recently suggested by Robert as the best and simplest all around
 [2].
 Did I miss something else?
 My point is: I am not sure I can be defined as a reviewer of this
 patch or take any credit in this patch review knowing that the latest
 version submitted is a simple rebase of the version I did my first
 review on. Hence, code speaking, this patch is in the same state as
 when it has been firstly submitted.

Of course you can.  Time spent reviewing is time spent reviewing,
whether it results in changes to the patch or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Peter Geoghegan

On Thu, Nov 13, 2014 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote:
 My point is: I am not sure I can be defined as a reviewer of this
 patch or take any credit in this patch review knowing that the latest
 version submitted is a simple rebase of the version I did my first
 review on. Hence, code speaking, this patch is in the same state as
 when it has been firstly submitted.

 Of course you can.  Time spent reviewing is time spent reviewing,
 whether it results in changes to the patch or not.


My thoughts exactly. I thought Michael did a good job, even if I
didn't agree with everything he said.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Robert Haas

On Wed, Nov 12, 2014 at 8:00 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch moves the Levenshtein distance implementation into core.

 Oops. Somehow managed to send a *.patch.swp file.  :-)

 Here is the actual patch.

Committed.  I changed varstr_leven() to varstr_levenshtein() because
abbrvs cn mk the code hrd to undstnd.  And to grep.  And I removed the
StaticAssertStmt you added, because it's not actually used for
anything that necessitates that, yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-13 Thread Peter Geoghegan

On Thu, Nov 13, 2014 at 9:36 AM, Robert Haas robertmh...@gmail.com wrote:
 Committed.  I changed varstr_leven() to varstr_levenshtein() because
 abbrvs cn mk the code hrd to undstnd.  And to grep.

Thanks. I'll produce a revision of patch 2/2 soon.

 And I removed the
 StaticAssertStmt you added, because it's not actually used for
 anything that necessitates that, yet.

I'll add it back in in patch 2/2, so.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Robert Haas

On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 Based on this review from a month ago, I'm going to mark this Waiting
 on Author.  If nobody updates the patch in a few days, I'll mark it
 Returned with Feedback.  Thanks.

 Attached revision fixes the compiler warning that Heikki complained
 about. I maintain SQL-callable stub functions from within contrib,
 rather than follow Michael's approach. In other words, very little has
 changed from my revision from July last [1].

I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors.  If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future.  Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.

On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs.  I think you need to
change that if you want to get anywhere with this.  On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting.  I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote:
 I agree with your proposed approach to moving Levenshtein into core.
 However, I think this should be separated into two patches, one of
 them moving the Levenshtein functionality into core, and the other
 adding the new treatment for missing column errors.  If you can do
 that relatively soon, I'll make an effort to get the refactoring patch
 committed in the near future.  Once that's done, we can focus in on
 the interesting part of the patch, which is the actual machinery for
 suggesting alternatives.

Okay, thanks. I think I can do that fairly soon.

 On that topic, I think there's unanimous consensus against the design
 where equally-distant matches are treated differently based on whether
 they are in the same RTE or different RTEs.  I think you need to
 change that if you want to get anywhere with this.

Alright. It wasn't as if I felt very strongly about it either way.

 On a related note,
 the use of the additional parameter AttrNumber closest[2] to
 searchRangeTableForCol() and of the additional parameters AttrNumber
 *matchedatt and int *distance to scanRTEForColumn() is less than
 self-documenting.  I suggest creating a structure called something
 like FuzzyAttrMatchState and passing a pointer to it down to both
 functions.

Sure.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan

On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote:
 I agree with your proposed approach to moving Levenshtein into core.
 However, I think this should be separated into two patches, one of
 them moving the Levenshtein functionality into core, and the other
 adding the new treatment for missing column errors.  If you can do
 that relatively soon, I'll make an effort to get the refactoring patch
 committed in the near future.  Once that's done, we can focus in on
 the interesting part of the patch, which is the actual machinery for
 suggesting alternatives.

 Okay, thanks. I think I can do that fairly soon.

Attached patch moves the Levenshtein distance implementation into core.

You're missing patch 2 of 2 here, because I have yet to incorporate
your feedback on the HINT itself -- when I've done that, I'll post a
newly rebased patch 2/2, with those items taken care of. As you
pointed out, there is no reason to wait for that.

-- 
Peter Geoghegan


.0001-Move-Levenshtein-distance-implementation-into-core.patch.swp
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan

On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch moves the Levenshtein distance implementation into core.

Oops. Somehow managed to send a *.patch.swp file.  :-)

Here is the actual patch.

-- 
Peter Geoghegan
From b7df918f1a52107637600f3b22d1cff18bd07ae1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH 1/2] Move Levenshtein distance implementation into core

The fuzzystmatch Levenshtein distance implementation is moved from
/contrib to core.  However, the related SQL-callable functions may still
only be used with the fuzzystmatch extension installed -- the
fuzzystmatch definitions become simple forwarding stubs.

It is not anticipated that the user-facing SQL functions will be moved
into core in the future, due to the MAX_LEVENSHTEIN_STRLEN restriction.
An in-core Levenshtein distance implementation is only anticipated to be
helpful in building diagnostic messages.
---
 contrib/fuzzystrmatch/Makefile|   3 -
 contrib/fuzzystrmatch/fuzzystrmatch.c |  81 ---
 contrib/fuzzystrmatch/levenshtein.c   | 403 --
 src/backend/utils/adt/Makefile|   2 +
 src/backend/utils/adt/levenshtein.c   | 394 +
 src/backend/utils/adt/varlena.c   |  24 ++
 src/include/utils/builtins.h  |   5 +
 7 files changed, 481 insertions(+), 431 deletions(-)
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c)  16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len  0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include levenshtein.c
-#define LEVENSHTEIN_LESS_EQUAL
-#include levenshtein.c
-
 PG_FUNCTION_INFO_V1(levenshtein_with_costs);
 Datum
 levenshtein_with_costs(PG_FUNCTION_ARGS)
@@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS)
 	int			ins_c = PG_GETARG_INT32(2);
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c,
+ del_c, sub_c));
 }
 
 
@@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS)
 {
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, 1, 1,
+	   1));
 }
 
 
@@ -206,8 +213,20 @@ levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
 	int			max_d = PG_GETARG_INT32(5);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+			ins_c, del_c, sub_c, max_d));
 }
 
 
@@ -218,8 +237,20 @@ levenshtein_less_equal(PG_FUNCTION_ARGS)
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
 	int			max_d = PG_GETARG_INT32(2);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Michael Paquier

On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote:

 Reminder: I maintain a slight preference for only offering one
 suggestion per relation RTE, which is what this revision does (so no
 change there). If a committer who picks this up wants me to alter
 that, I don't mind doing so; since only Michael spoke up on this, I've
 kept things my way.

Hm. The last version of this patch has not really changed since since my
first review, and I have no more feedback to provide about it except what I
already mentioned. I honestly don't think that this patch is ready for
committer as-is... If someone wants to review it further, well extra
opinions I am sure are welcome.
-- 
Michael

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Peter Geoghegan

On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Hm. The last version of this patch has not really changed since since my
 first review, and I have no more feedback to provide about it except what I
 already mentioned. I honestly don't think that this patch is ready for
 committer as-is... If someone wants to review it further, well extra
 opinions I am sure are welcome.

Why not?

You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.

What does that leave?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Peter Geoghegan

On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan p...@heroku.com wrote:
 Why not?

 You've already said that you're happy to defer to whatever committer
 picks this up with regard to whether or not more than a single
 suggestion can come from an RTE. I agreed with this (i.e. I said I'd
 defer to their opinion too), and once again drew particular attention
 to this state of affairs alongside my most recent revision.

 What does that leave?

I see you've marked this Needs Review, even though your previously
marked it Ready for Committer a few months back (Robert marked it
Waiting on Author very recently because of the compiler warning, and
then I marked it back to Ready for Committer once that was
addressed, before you finally marked it back to Needs Review and
removed yourself as the reviewer just now).

I'm pretty puzzled by this. Other than our agree to disagree and
defer to committer position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1], I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2].

Did I miss something else?

[1] 
http://www.postgresql.org/message-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-bvqzgp0qa1y_xdc...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/ca+tgmoykiiq8mc0uj5i5xfktybg1qqfn4yrckz60ydunumk...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-09 Thread Peter Geoghegan

On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 Based on this review from a month ago, I'm going to mark this Waiting
 on Author.  If nobody updates the patch in a few days, I'll mark it
 Returned with Feedback.  Thanks.

Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1].

Reminder: I maintain a slight preference for only offering one
suggestion per relation RTE, which is what this revision does (so no
change there). If a committer who picks this up wants me to alter
that, I don't mind doing so; since only Michael spoke up on this, I've
kept things my way.

This is not a completion mechanism; it is supposed to work on
*complete* column references with slight misspellings (e.g. incorrect
use of plurals, or column references with an omitted underscore
character). Weighing Tom's concerns about suggestions that are of
absolute low quality is what makes me conclude that this is the thing
to do.

[1] 
http://www.postgresql.org/message-id/CAM3SWZTzQO=oy4jmfb-65iefie8ihukderk-0oljetm8dsr...@mail.gmail.com
-- 
Peter Geoghegan
From 830bf9f668972ba6b531df5d4fcbd73db3472434 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in his or her ad-hoc SQL
query, or incorrectly pluralizes or fails to pluralize a column
reference.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (iff each
candidate column comes from a separate RTE).  Limiting the cases where
multiple equally likely suggestions are all offered at once is a measure
against suggestions that are of low quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.

The contrib Levenshtein distance implementation is moved from /contrib
to core.  However, the SQL-callable functions may only be used with the
fuzzystmatch extension installed, just as before -- the fuzzystmatch
definitions become mere forwarding stubs.
---
 contrib/fuzzystrmatch/Makefile|   3 -
 contrib/fuzzystrmatch/fuzzystrmatch.c |  81 --
 contrib/fuzzystrmatch/levenshtein.c   | 403 --
 src/backend/parser/parse_expr.c   |   9 +-
 src/backend/parser/parse_func.c   |   2 +-
 src/backend/parser/parse_relation.c   | 319 ---
 src/backend/utils/adt/Makefile|   2 +
 src/backend/utils/adt/levenshtein.c   | 393 +
 src/backend/utils/adt/varlena.c   |  25 ++
 src/include/parser/parse_relation.h   |   3 +-
 src/include/utils/builtins.h  |   5 +
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out|  39 +++
 src/test/regress/expected/plpgsql.out |   1 +
 src/test/regress/expected/rowtypes.out|   1 +
 src/test/regress/expected/rules.out   |   1 +
 src/test/regress/expected/without_oid.out |   1 +
 src/test/regress/sql/join.sql |  24 ++
 18 files changed, 849 insertions(+), 471 deletions(-)
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c)  16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len  0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include levenshtein.c
-#define LEVENSHTEIN_LESS_EQUAL
-#include levenshtein.c
-
 PG_FUNCTION_INFO_V1(levenshtein_with_costs);
 Datum
 levenshtein_with_costs(PG_FUNCTION_ARGS)
@@

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-09 Thread Michael Paquier

On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote:

 On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com
 wrote:
  Based on this review from a month ago, I'm going to mark this Waiting
  on Author.  If nobody updates the patch in a few days, I'll mark it
  Returned with Feedback.  Thanks.

 Attached revision fixes the compiler warning that Heikki complained
 about. I maintain SQL-callable stub functions from within contrib,
 rather than follow Michael's approach. In other words, very little has
 changed from my revision from July last [1].

FWIW, I still find this bit of code that this patch adds in varlena.c ugly:
+#include levenshtein.c
+#define LEVENSHTEIN_LESS_EQUAL
+#include levenshtein.c
+#undef LEVENSHTEIN_LESS_EQUAL
-- 
Michael

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-09 Thread Peter Geoghegan

On Sun, Nov 9, 2014 at 8:56 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 FWIW, I still find this bit of code that this patch adds in varlena.c ugly:

 +#include levenshtein.c
 +#define LEVENSHTEIN_LESS_EQUAL
 +#include levenshtein.c
 +#undef LEVENSHTEIN_LESS_EQUAL

Okay, but this is the coding that currently appears within contrib's
fuzzystrmatch.c, more or less unchanged. The #undef
LEVENSHTEIN_LESS_EQUAL line that I added ought to be unnecessary.
I'll give the final word on that to whoever picks this up.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-07 Thread Robert Haas

On Mon, Oct 6, 2014 at 3:09 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 07/18/2014 10:47 AM, Michael Paquier wrote:

 On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan p...@heroku.com wrote:

 I am not opposed to moving the contrib code into core in the manner
 that you oppose. I don't feel strongly either way.

 I noticed in passing that your revision says this *within* levenshtein.c:

 + * Guaranteed to work with Name datatype's cstrings.
 + * For full details see levenshtein.c.

 Yeah, I looked at what I produced yesterday night again and came
 across a couple of similar things :) And reworked a couple of things
 in the version attached, mainly wordsmithing and adding comments here
 and there, as well as making the naming of the Levenshtein functions
 in core the same as the ones in fuzzystrmatch 1.0.

 I imagined that when a committer picked this up, an executive decision
 would be made one way or the other. I am quite willing to revise the
 patch to alter this behavior at the request of a committer.

 Fine for me. I'll move this patch to the next stage then.


 There are a bunch of compiler warnings:

 parse_relation.c: In function ‘errorMissingColumn’:
 parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized
 in this function [-Wmaybe-uninitialized]
 parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
 parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized
 in this function [-Wmaybe-uninitialized]
 parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
 levenshtein.c: In function ‘levenshtein_common’:
 levenshtein.c:107:6: warning: unused variable ‘start_column_local’
 [-Wunused-variable]

Based on this review from a month ago, I'm going to mark this Waiting
on Author.  If nobody updates the patch in a few days, I'll mark it
Returned with Feedback.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-10-06 Thread Heikki Linnakangas


On 07/18/2014 10:47 AM, Michael Paquier wrote:

On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan p...@heroku.com wrote:

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.


I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

Fine for me. I'll move this patch to the next stage then.


There are a bunch of compiler warnings:

parse_relation.c: In function ‘errorMissingColumn’:
parse_relation.c:3114:447: warning: ‘closestcol1’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]

parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
parse_relation.c:3129:29: warning: ‘closestcol2’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]

parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
levenshtein.c: In function ‘levenshtein_common’:
levenshtein.c:107:6: warning: unused variable ‘start_column_local’ 
[-Wunused-variable]


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Robert Haas

On Thu, Jul 17, 2014 at 9:34 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Patch 1 does a couple of things:
 - fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of
 it anymore, and moved to core.
 - Removal of the LESS_EQUAL flag that made the original submission patch
 harder to understand. All the Levenshtein functions wrap a single common
 function.
 - Documentation is moved, and regression tests for Levenshtein functions are
 added.
 - Functions with costs are renamed with a suffix with costs.
 After hacking this feature, I came up with the conclusion that it would be
 better for the user experience to move directly into backend code all the
 Levenshtein functions, instead of only moving in the common wrapper as Peter
 did in his original patches. This is done this way to avoid keeping portions
 of the same feature in two different places of the code (backend with common
 routine, fuzzystrmatch with levenshtein functions) and concentrate all the
 logic in a single place. Now, we may as well consider renaming the
 levenshtein functions into smarter names, like str_distance, and keep
 fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the
 str_distance functions.

This is not cool.  Anyone who is running a 9.4 or earlier database
using fuzzystrmatch and upgrades, either via dump-and-restore or
pg_upgrade, to a version with this patch applied will have a broken
database.  They will still have the catalog entries for the 1.0
definitions, but those definitions won't be resolvable inside the new
cluster's .so file.  The user will get a fairly-unfriendly error
message that won't go away until they upgrade the extension, which may
involve dealing with dependency hell since the new definitions are in
a different place than the old definitions, and there may be
dependencies on the old definitions.  One of the great advantages of
extension packaging is that this kind of problem is quite easily
avoidable, so let's avoid it.

There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.  Then, the
whole thing will be completely transparent to users.  They won't need
to upgrade their fuzzystrmatch definitions at all, and everything will
just work; under the covers, the fuzzystrmatch code will now be
calling into core code rather than to code located in that same
module, but the user doesn't need to know or care about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 There are several possible methods of doing that, but I think the best
 one is just to leave the SQL-callable C functions in fuzzystrmatch and
 move only the underlying code that supports into core.

I hadn't been paying close attention to this thread, but I'd just assumed
that that would be the approach.

It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Peter Geoghegan

On Wed, Jul 23, 2014 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote:
 There are several possible methods of doing that, but I think the best
 one is just to leave the SQL-callable C functions in fuzzystrmatch and
 move only the underlying code that supports into core.

For some reason I thought that that was what Michael was proposing - a
more comprehensive move of code into core than the structuring that I
proposed. I actually thought about a Levenshtein distance operator at
one point months ago, before I entirely gave up on that. The
MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
distance functions are not suitable for core as is (although that
doesn't matter for my purposes, since all I need is something that
accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
considerable limitation for an in-core feature. I didn't get around to
forming an opinion on how and if that should be fixed.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Alvaro Herrera

Peter Geoghegan wrote:

 For some reason I thought that that was what Michael was proposing - a
 more comprehensive move of code into core than the structuring that I
 proposed. I actually thought about a Levenshtein distance operator at
 one point months ago, before I entirely gave up on that. The
 MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
 distance functions are not suitable for core as is (although that
 doesn't matter for my purposes, since all I need is something that
 accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
 considerable limitation for an in-core feature. I didn't get around to
 forming an opinion on how and if that should be fixed.

I had two thoughts:

1. Should we consider making levenshtein available to frontend programs
as well as backend?
2. Would it provide better matching to use Damerau-Levenshtein[1] instead
of raw Levenshtein?

.oO(Would anyone be so bold as to attempt to implement bitap[2] using
bitmapsets ...)

[1] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
[2] http://en.wikipedia.org/wiki/Bitap_algorithm

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Peter Geoghegan

On Wed, Jul 23, 2014 at 1:10 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 I had two thoughts:

 1. Should we consider making levenshtein available to frontend programs
 as well as backend?

I don't think so. Why would that be useful?

 2. Would it provide better matching to use Damerau-Levenshtein[1] instead
 of raw Levenshtein?

Maybe that would be marginally better than classic Levenshtein
distance, but I doubt it would pay for itself. It's just more code to
maintain. Are we really expecting to not get the best possible
suggestion due to some number of transposition errors very frequently?
You still have to have a worse suggestion spuriously get ahead of
yours, and typically there just aren't that many to begin with. I'm
not targeting spelling errors so much as thinkos around plurals and
whether or not an underscore was used. Damerau-Levenshtein seems like
an algorithm with fairly specialized applications.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Michael Paquier

On Thu, Jul 24, 2014 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Robert Haas robertmh...@gmail.com writes:
  There are several possible methods of doing that, but I think the best
  one is just to leave the SQL-callable C functions in fuzzystrmatch and
  move only the underlying code that supports into core.

 I hadn't been paying close attention to this thread, but I'd just assumed
 that that would be the approach.

 It might be worth introducing new differently-named pg_proc entries for
 the same functions in core, but only if we can agree that there are better
 names for them than what the extension uses.

Yes, that's a point I raised upthread as well. What about renaming those
functions as string_distance and string_distance_less_than? Then have only
fuzzystrmatch do some DirectFunctionCall using the in-core functions?
-- 
Michael

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-23 Thread Alvaro Herrera

Peter Geoghegan wrote:

 Maybe that would be marginally better than classic Levenshtein
 distance, but I doubt it would pay for itself. It's just more code to
 maintain. Are we really expecting to not get the best possible
 suggestion due to some number of transposition errors very frequently?
 You still have to have a worse suggestion spuriously get ahead of
 yours, and typically there just aren't that many to begin with. I'm
 not targeting spelling errors so much as thinkos around plurals and
 whether or not an underscore was used. Damerau-Levenshtein seems like
 an algorithm with fairly specialized applications.

Yes, it's for typos.  I guess it's an unfrequent scenario to have both a
typoed column and a column that's missing the plural declension, which
is the case in which Damerau-Lvsh would be a win.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-17 Thread Peter Geoghegan

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

On Thu, Jul 17, 2014 at 6:34 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
Patch 2 is a rebase of the feature of Peter that can be applied on top of
patch 1. The code is rather untouched (haven't much played with Peter's
thingies), well-commented, but I think that this needs more work,
particularly when a query has a single RTE like in this case where no hints
are proposed to the user (mentioned upthread):

The only source of disagreement that I am aware of at this point is
the question of whether or not we should accept two candidates from
the same RTE. I lean slightly towards no, as already explained [1]
[2], but it's not as if I feel that strongly either way - this
approach of looking for only a single best candidate per RTE taken in
deference to the concerns of others.

I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

[1]
http://www.postgresql.org/message-id/CAM3SWZTrm4PmqMmL9=eyx-8f-vx-ha7dme4koms2vcomozg...@mail.gmail.com
[2]
http://www.postgresql.org/message-id/cam3swzs6kiqeqjz4pv3fkp6cgw1ws26exoqtjb_xmw3ze5b...@mail.gmail.com
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Michael Paquier

On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote:
  I was worried about the common case where a
  column name is misspelled that would otherwise be ambiguous, which is
  why that shows a HINT while the single RTE case doesn't

 To be clear - I mean a HINT with two suggestions rather than just one.
 If there are 3 or more equally distant suggestions (even if they're
 all from different RTEs) we also give no HINT in the proposed patch.

Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.
-- 
Michael

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Michael Paquier

On Wed, Jul 9, 2014 at 1:49 AM, Peter Geoghegan p...@heroku.com wrote:

  4) This is not nice, could it be possible to remove the stuff from
 varlena.c?
  +/* Expand each Levenshtein distance variant */
  +#include levenshtein.c
  +#define LEVENSHTEIN_LESS_EQUAL
  +#include levenshtein.c
  +#undef LEVENSHTEIN_LESS_EQUAL
  Part of the same comment: only varstr_leven_less_equal is used to
  calculate the distance, should we really move varstr_leven to core?
  This clearly needs to be reworked as not just a copy-paste of the
  things in fuzzystrmatch.
  The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I
 think.

 So there'd be one variant within core and one within
 contrib/fuzzystrmatch? I don't think that's an improvement.

No. The main difference between varstr_leven_less_equal and varstr_leven is
the use of the extra argument max_d in the former. My argument here is
instead of blindly cut-pasting into core the code you are interested in to
evaluate the string distances, is to refactor it to have a unique function,
and to let the business with LEVENSHTEIN_LESS_EQUAL within
contrib/fuzzystrmatch. This will require some reshuffling of the distance
function, but by looking at this patch I am getting the feeling that this
is necessary, and should even be split into a first patch for fuzzystrmatch
that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?

 5) Do we want hints on system columns as well?
 I think it's obvious that the answer must be no. That's going to
 frequently result in suggestions of columns that users will complain
 aren't even there. If you know about the system columns, you can just
 get it right. They're supposed to be hidden for most purposes.

This may sound ridiculous, but I have already found myself mistyping ctid
by tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on
(bad morning). I would have welcomed such hints in those cases.
-- 
Michael

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jul 8, 2014 at 11:10 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Showing up to 2 hints is fine as it does not pollute the error output with
 perhaps unnecessary messages. That's even more protective than for example
 git that prints all the equidistant candidates. However I can't understand
 why it does not show up hints even if there are two equidistant candidates
 from the same RTE. I think it should.

Everyone is going to have an opinion on something like that. I was
showing deference to the general concern about the absolute (as
opposed to relative) quality of the HINTs in the event of equidistant
matches by having no two suggestions come from within a single RTE,
while still covering the case I thought was important by having two
suggestions if there were two equidistant matches across RTEs. I think
that's marginally better then what you propose, because your case
deals with two equidistant though distinct columns, bringing into
question the validity of both would-be suggestions. I'll defer to
whatever the consensus is.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 So there'd be one variant within core and one within
 contrib/fuzzystrmatch? I don't think that's an improvement.

 No. The main difference between varstr_leven_less_equal and varstr_leven is
 the use of the extra argument max_d in the former. My argument here is
 instead of blindly cut-pasting into core the code you are interested in to
 evaluate the string distances, is to refactor it to have a unique function,
 and to let the business with LEVENSHTEIN_LESS_EQUAL within
 contrib/fuzzystrmatch. This will require some reshuffling of the distance
 function, but by looking at this patch I am getting the feeling that this is
 necessary, and should even be split into a first patch for fuzzystrmatch
 that would facilitate its integration into core.
 Also why is rest_of_char_same within varlena.c?

Just as before, rest_of_char_same() exists for the express purpose of
being called by the two variants varstr_leven_less_equal() and
varstr_leven(). Why wouldn't I copy it over too along with those two?
Where do you propose to put it?

Obviously the existing macro hacks (that I haven't changed) that build
the two variants are not terribly pretty, but they're not arbitrary
either. They reflect the fact that there is no natural way to add
callbacks or something like that. If you pretended that the core code
didn't have to care about one case or the other, and that contrib was
somehow obligated to hook in its own handler for the
!LEVENSHTEIN_LESS_EQUAL case that it now only cares about, then you'd
end up with an even bigger mess. Besides, with the patch the core code
is calling varstr_leven_less_equal(), which is the bigger of the two
variants - it's the LEVENSHTEIN_LESS_EQUAL case, not the
!LEVENSHTEIN_LESS_EQUAL case that core cares about for the purposes of
building HINTs. In short, I don't know what you mean. What would that
reshuffling actually look like?

  5) Do we want hints on system columns as well?
 I think it's obvious that the answer must be no. That's going to
 frequently result in suggestions of columns that users will complain
 aren't even there. If you know about the system columns, you can just
 get it right. They're supposed to be hidden for most purposes.

 This may sound ridiculous, but I have already found myself mistyping ctid by
 tid and cid while working on patches and modules that played with page
 format, and needing a couple of minutes to understand what was going on (bad
 morning).

I think that it's clearly not worth it, even if it is true that a
minority sometimes make this mistake. Most users don't know that there
are system columns. It's not even close to being worth it to bring
that into this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Alvaro Herrera

Peter Geoghegan wrote:
 On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

   5) Do we want hints on system columns as well?
  I think it's obvious that the answer must be no. That's going to
  frequently result in suggestions of columns that users will complain
  aren't even there. If you know about the system columns, you can just
  get it right. They're supposed to be hidden for most purposes.
 
  This may sound ridiculous, but I have already found myself mistyping ctid by
  tid and cid while working on patches and modules that played with page
  format, and needing a couple of minutes to understand what was going on (bad
  morning).
 
 I think that it's clearly not worth it, even if it is true that a
 minority sometimes make this mistake. Most users don't know that there
 are system columns. It's not even close to being worth it to bring
 that into this.

I agree with Peter.  This is targeted at regular users.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Robert Haas

On Wed, Jul 9, 2014 at 2:10 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote:
  I was worried about the common case where a
  column name is misspelled that would otherwise be ambiguous, which is
  why that shows a HINT while the single RTE case doesn't

 To be clear - I mean a HINT with two suggestions rather than just one.
 If there are 3 or more equally distant suggestions (even if they're
 all from different RTEs) we also give no HINT in the proposed patch.

 Showing up to 2 hints is fine as it does not pollute the error output with
 perhaps unnecessary messages. That's even more protective than for example
 git that prints all the equidistant candidates. However I can't understand
 why it does not show up hints even if there are two equidistant candidates
 from the same RTE. I think it should.

Me, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Greg Stark

On Wed, Jul 9, 2014 at 7:29 AM, Peter Geoghegan p...@heroku.com wrote:
 Everyone is going to have an opinion on something like that. I was
 showing deference to the general concern about the absolute (as
 opposed to relative) quality of the HINTs in the event of equidistant
 matches by having no two suggestions come from within a single RTE,
 while still covering the case I thought was important by having two
 suggestions if there were two equidistant matches across RTEs. I think
 that's marginally better then what you propose, because your case
 deals with two equidistant though distinct columns, bringing into
 question the validity of both would-be suggestions. I'll defer to
 whatever the consensus is.

I agree this is bike shedding. But as long as we're bike shedding...

A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance. If they match exactly except for the case and the
unmatched column is all lower case add a comment that quoting is
required due to the mixed case.

Honestly the current logic and the previous logic both seemed
reasonable to me. They're not going to be perfect in every case so
anything that comes up some some suggestions is fine.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark st...@mit.edu wrote:
 A simple rule is easier for users to understand as well as to code. I
 would humbly suggest the following: take all the unqualified column
 names, downcase them, check which ones match most closely the
 unmatched column. Show the top 3 matches if they're within some
 arbitrary distance.

That's harder than it sounds. You need even more translatable strings
for variant ereports(). I don't think that an easy to understand rule
is necessarily of much value - I'm already charging half price for
deletion because I found representative errors more useful in certain
cases by doing so. I think we want something that displays the most
useful suggestion as often as is practically possible, and does not
display unhelpful suggestions to the extent that it's practical to
avoid them. Plus, as I mentioned, I'm keen to avoid adding more stuff
to scanRTEForColumn() than I already have.

 Honestly the current logic and the previous logic both seemed
 reasonable to me. They're not going to be perfect in every case so
 anything that comes up some some suggestions is fine.

I think that the most recent revision is somewhat better due to the
feedback of Tom and Robert. I didn't feel as strongly as they did
about erring on the side of not showing a HINT, but I think the most
recent revision is a good compromise. But yes, at this point we're
certainly chasing diminishing returns. There are almost any number of
variants of this basic idea that could be suggested.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 8:06 AM, Robert Haas robertmh...@gmail.com wrote:
 Showing up to 2 hints is fine as it does not pollute the error output with
 perhaps unnecessary messages. That's even more protective than for example
 git that prints all the equidistant candidates. However I can't understand
 why it does not show up hints even if there are two equidistant candidates
 from the same RTE. I think it should.

 Me, too.

The idea is that each RTE gets one best suggestion, because if there
are two best suggestions within an RTE they're probably both wrong.
Whereas across RTEs, it's probably just that there is a foreign key
relationship between the two (and the user accidentally failed to
qualify the particular column of interest on top of the misspelling, a
qualification that would be sufficient to have the code prefer the
qualified-but-misspelled column). Clearly if I was to do what you
suggest it would be closer to a wild guess, and Tom has expressed
concerns about that.

Now, I don't actually ensure that the column names of the two columns
(each from separate RTEs) are identical save for their would-be alias,
but that's just a consequence of the implementation. Also, as I've
mentioned, I don't want to put more stuff in scanRTEForColumn() than I
already have, due to your earlier concern about adding clutter.

I think we're splitting hairs at this point, and frankly I'll do it
that way if it gets the patch closer to being committed. While I
thought it was important to get the unqualified and misspelled case
right (which I did in the first revision, but perhaps at the expense
of Tom's concern about absolute suggestion quality), I don't feel
strongly about this detail either way.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-09 Thread Alvaro Herrera

Peter Geoghegan wrote:
 On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark st...@mit.edu wrote:
  A simple rule is easier for users to understand as well as to code. I
  would humbly suggest the following: take all the unqualified column
  names, downcase them, check which ones match most closely the
  unmatched column. Show the top 3 matches if they're within some
  arbitrary distance.
 
 That's harder than it sounds. You need even more translatable strings
 for variant ereports().

Maybe it is possible to rephrase the message so that the translatable
part doesn't need to concern with how many suggestions there are.  For
instance something like perhaps you meant a name from the following
list: foo, bar, baz.  Couple with the errmsg_plural stuff, you then
don't need to worry too much about providing different strings for 1, 2,
N suggestions.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 2:19 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 That's harder than it sounds. You need even more translatable strings
 for variant ereports().

 Maybe it is possible to rephrase the message so that the translatable
 part doesn't need to concern with how many suggestions there are.  For
 instance something like perhaps you meant a name from the following
 list: foo, bar, baz.  Couple with the errmsg_plural stuff, you then
 don't need to worry too much about providing different strings for 1, 2,
 N suggestions.

That's not really the problem. I already have a lot of things to test
in each of the two ereport() calls. More importantly, showing the
closet, say, 3 matches under an arbitrary distance does not weigh
concerns about that indicating that they're all bad. It's not like
bash tab completion - if there is one best match, that's probably
because that's what the user meant. Whereas if there are two or more
within a single RTE, that's probably because both are unhelpful. They
both happened to require the same number of substitutions to get to,
while not being quite bad enough matches to be excluded by the final
check against a normalized distance threshold (the final check that
prevents ludicrous suggestions).

The fact that there were multiple equally plausible candidates (that
are not identically named and just from different RTEs) tells us
plenty, unlike with tab completion. It's not hard for one column to be
a better match than another, and so it doesn't seem unreasonable to
insist upon that within a single RTE where they cannot be identical,
since a conservative approach seems to be what is generally favored.
In any case I'm just trying to weigh everyone's concerns here. I hope
it's actually possible to compromise, but right now I don't know what
I can do to make useful progress.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()