Re: [HACKERS] pass-through queries to foreign servers

2013-08-10 Thread David Gudeman
On Mon, Aug 5, 2013 at 12:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 David Gudeman dave.gude...@gmail.com writes:
 For those who don't want to go to the link to see what I'm talking
 about with query rewrites, I thought I'd give a brief description.
 Foreign data wrappers currently do all of their work in the planning
 phase but I claim that isn't the right place to optimize foreign
 queries with aggregates and GROUP BY because optimizing those things
 would involve collapsing multiple plan node back into a single node
 for a foreign call.

 I'm not sure what the best implementation for that is, but what you
 propose here would still involve such collapsing, so this argument
 seems rather empty.

 I propose to do these optimizations as query
 rewrites instead. So for example suppose t is a foreign table on the
 foreign server named fs. Then the query

   SELECT count(*) FROM t

 is rewritten to

   SELECT count FROM fs('select count(*) from t') fs(count bigint)

 where ts() is the pass-through query function for the server fs. To
 implement this optimization as a query rewrite, all of the elements of
 the result have to be real source-language constructs so the
 pass-through query has to be available in Postgresql SQL.

 I don't believe in any part of that design, starting with the pass
 through query function.  For one thing, it seems narrowly targeted to the
 assumption that the FDW is a frontend for a foreign server that speaks
 SQL.  If the FDW's infrastructure doesn't include some kind of textual
 query language, this isn't going to be useful for it at all.  For another,
 a query rewrite system is unlikely to be able to cost out the alternatives
 and decide whether pushing the aggregation across is actually a win or
 not.

 The direction I think we ought to be heading is to generate explicit Paths
 representing the various ways in which aggregation can be implemented.
 The logic in grouping_planner is already overly complex, and hard to
 extend, because it's all hard-wired comparisons of alternatives.  We'd be
 better off with something more like the add_path infrastructure.  Once
 that's been done, maybe we can allow FDWs to add Paths representing remote
 aggregation.

 regards, tom lane

Well, I will have to defer to your greater knowledge of the Posgres
planner. I'm surprised that it can do that since in my experience, the
planner did a cost-based search to find the best join order but the
rest of the plan structure was generated independently of the tables.
If the Postgres planner can look at the FROM clause and generate a
completely different structure based on the types of tables it finds
there, then that doesn't sound unreasonable.

However it still seems a bit like overkill. The purpose of a
cost-based heuristic search is to deal with the fact that there are
lots and lots of possible join orders (as the number of tables grows)
and it makes a huge difference which order you chose. By contrast,
this decision is fairly easy. Either you send the computation to the
data or you bring the data to the computation. Since in the case of
SQL aggregation the computation generally reduces the amount of data
that has to be  moved, the first choice is almost always the right
one.

Also, this means that you won't be able to use the code that I've
written since it's all based on query rewriting. I'm disappointed
because I wanted to contribute something back, but that's for you guys
to decide.

Regards,
David Gudeman
http://unobtainabol.blogspot.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pass-through queries to foreign servers

2013-08-05 Thread David Gudeman
For those who don't want to go to the link to see what I'm talking
about with query rewrites, I thought I'd give a brief description.
Foreign data wrappers currently do all of their work in the planning
phase but I claim that isn't the right place to optimize foreign
queries with aggregates and GROUP BY because optimizing those things
would involve collapsing multiple plan node back into a single node
for a foreign call. I propose to do these optimizations as query
rewrites instead. So for example suppose t is a foreign table on the
foreign server named fs. Then the query

  SELECT count(*) FROM t

is rewritten to

  SELECT count FROM fs('select count(*) from t') fs(count bigint)

where ts() is the pass-through query function for the server fs. To
implement this optimization as a query rewrite, all of the elements of
the result have to be real source-language constructs so the
pass-through query has to be available in Postgresql SQL.

My current implementation of this uses a plugin that hooks into
planner_hook, but I'm hoping that I can get some support for adding
the query rewriting as callback functions for the FDW system.

Regards,
David Gudeman
http://unobtainabol.blogspot.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pass-through queries to foreign servers

2013-08-01 Thread David Gudeman
On Tue, Jul 30, 2013 at 10:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 David Fetter da...@fetter.org writes:
 On Tue, Jul 30, 2013 at 04:40:38PM -0700, David Gudeman wrote:
 When you write an application involving foreign tables, you frequently
 end up with queries that are just too inefficient because they bring
 too much data over from the foreign server. For a trivial example,
 consider SELECT count(*) FROM t where t is a foreign table. This
 will pull the entire table over the network just to count up the rows.

 Yes, and this case is a known limitation of our planner
 infrastructure.   Aggregates are special when it comes to
 generating paths for the planner to evaluate, so there's no current
 way a FDW could supply such info to the planner, and hence no API in
 our FDW code for having FDWs supply that info.  That's probably a
 should fix but I don't know whether a project that size could be
 done by 9.4.

 Yeah.  There's a lot left to be done in the FDW infrastructure.
 But not this:

 All that said, my DBI-Link, back in the bad old days, provided two
 important functions: remote_select(), which returned SETOF RECORD and
 remote_execute(), which returned nothing.  It also provided ways to
 control connections to the remote host, introspect remote schemas,
 etc., etc.  We need capabilities like that in the FDW API, I believe
 we could have them by 9.4.

 I would argue we *don't* want that.  If you want pass-through queries
 or explicit connection control, your needs are already met by dblink or
 dbi-link.  The whole point of FDW is that it's at a higher level of
 abstraction than that; which offers greater ease of use and will
 eventually offer better optimization than what you can get from dblink
 et al.  If we start trying to shoehorn things like passthrough queries
 into FDW, we'll be crippling the technology.  As an example, somebody
 on planet postgresql was just recently surprised to find that postgres_fdw
 honors transaction rollback.  Well, it can do that because users can't
 disconnect the connection underneath it, nor issue passthrough
 commit/rollback commands.  You don't get to have it both ways.

 regards, tom lane

Tom, you have a good point about transaction management, but I think
we _can_ have it both ways. There are several things that the author
of the foreign data wrapper can do to prevent bad things from being
done since he has ultimate control of everything that gets sent to the
foreign server. For many foreign servers it is enough to check that
the string being sent to the foreign server begins with select . Or
he can prevent pass-through queries when there is an on-going
transaction on the foreign server. Or the author of a particular
foreign data wrapper can prevent pass-through queries entirely.

The point is that this is only a concern for some kinds of foreign
servers and even then only for those foreign data wrappers that care
about transactions. If they don't implement update/insert/delete, for
example, then it doesn't matter. Since there are many other kinds of
foreign servers where this could be useful, it should be available, at
least as an option.

The reason I want it is to do use it for some of the things that David
Fetter was talking about --optimizing queries with aggregates and
GROUP BY. I have code that currently optimizes these sorts of queries
for a particular database engine. I did this several years ago before
there were foreign data wrappers so I had to roll my own using table
functions. My implementation query rewrite rather than plan
optimization (it seemed to me to be too hard to do in the panning
phase). See 
http://unobtainabol.blogspot.com/2013/04/daves-foreign-data-translating-foreign_24.html
for a description of what it does.

My plan was to generalize my current code to generic SQL databases and
to make it work with foreign data wrappers. If there is any interest
from the PG community I'll try to get my company to let me contribute
this back. But the first thing I need is to implement pass-through
queries for foreign servers or I have to duplicate all of the
functionality for managing foreign servers and tables.

Regards,
David Gudeman
http://unobtainabol.blogspot.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pass-through queries to foreign servers

2013-07-30 Thread David Gudeman
When you write an application involving foreign tables, you frequently
end up with queries that are just too inefficient because they bring
too much data over from the foreign server. For a trivial example,
consider SELECT count(*) FROM t where t is a foreign table. This
will pull the entire table over the network just to count up the rows.
If the writer of the foreign data wrapper was clever enough, this may
only pull one column from the foreign server, but that can still be a
lot of data.

To solve (or work around) this problem, it would be convenient to have
a pass-through query mechanism associated with foreign servers. A
pass-through query would look like a table function, but would use the
name of the foreign server as the function name. For example:

CREATE SERVER foo ...;
CREATE USER MAPPING ...;
CREATE FOREIGN TABLE t (...) SERVER foo ... OPTIONS (table 't');

SELECT size FROM foo('SELECT count(*) FROM t') AS t(size BIGINT);

The SELECT above will execute the quoted string as a query on the
foreign server represented by foo. (Notice that only the CREATE SERVER
and CREATE USER MAPPING are needed for the SELECT to work. I just
added the CREATE FOREIGN TABLE for context.)

I can think of two ways to implement this. I think it would pretty
easy to just add a table function foo that does the right thing. This
would require the author of the foreign data wrapper to provide
another callback function to send the query and get back the results.
Such a callback function would largely duplicate the functionality of
the current callback functions and --because of the current
implementation of table functions-- it would materialize the entire
result set before returning it.

A more difficult solution (for me, at least) would be to construct a
sort of temporary foreign table from the pass-through query then let
it go through the usual foreign-table handling code. This also would
require some changes to foreign data wrappers. Current wrappers have
to construct a query to scan a foreign table but with a pass-through
query the query is already constructed. But this probably requires
less work for the authors of foreign data wrappers and it doesn't
materialize the results of the foreign query unnecessarily.

Any suggestions or hints?

Regards,
David Gudeman
http://unobtainabol.blogspot.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] minimizing the target list for foreign data wrappers

2013-04-23 Thread David Gudeman
In case anyone is interested, I tried it and it doesn't seem to work.
It looks like some other plan element already has the target-list
tuple baked. Now I'm trying to decide whether to give up on FDW. It's
a shame because it's such a sweet facility, but at this point, I just
don't think that it's mature enough for what I need to do.
Regards,
David Gudeman


On Mon, Apr 22, 2013 at 11:27 AM, David Gudeman dave.gude...@gmail.com wrote:
 Re-reading my first email I thought it was a little confusing, so here
 is some clarification. In GetForeignPlan, tlist seems to be a target
 list for a basic select * from the foreign table. For the ith
 TargetEntry te in tlist, it seems that te-expr is a var with
 varattno=i. I was mis-remembering and calling varattno attrno in the
 original email.

 My assumption is that the plan elements that use the output of the FDW
 plan node will access columns indirectly using tlist. In other words,
 I'm assuming that if there is a reference to a column c of the foreign
 table, this column will be represented as a Var with varattno being an
 offset into tlist. So if c is column number 3, for example, you get
 its value by looking up TargetEntry number 3 in tlist and evaluate the
 expr column for that TargetEntry. So if I change the  Var in the expr
 column so the varattno points to a different column in the output
 tuple, then everything will work.

 The two risky assumptions I'm making are 1. that it actually uses this
 indirect way of looking up columns in a foreign table and 2. that it
 actually uses the tlist that I pass in when I call make_foreignscan().

 Can anyone confirm or deny these assumptions?

 Thanks.

 On Sun, Apr 21, 2013 at 6:57 PM, David Gudeman dave.gude...@gmail.com wrote:

 A few years ago I wrote a roll-your-own foreign-data-wrapper system for 
 Postgres because Postgres didn't have one at the time (some details here 
 (http://unobtainabol.blogspot.com/2013/04/dave-foreign-data-introuction.html)
  if anyone is interested). Now I'm being tasked to move it to Postgres 9.2.x 
 and I'd like to use FDW if possible.

 One of the problems I'm having is that in my application, the foreign tables 
 typically have hundreds of columns while typical queries only access a dozen 
 or so (the foreign server is a columnar SQL database). Furthermore, there is 
 no size optimization for NULL values passed back from the foreign server, so 
 if I return all of the columns from the table --even as NULLs-- the returned 
 data size will be several times the size that it needs to be. My application 
 cannot tolerate this level of inefficiency, so I need to return minimal 
 columns from the foreign table.

 The documentation doesn't say how to do this, but looking at the code I 
 think it is possible. In GetForeignPlan() you have to pass on the tlist 
 argument, which I presume means that the query plan will use the tlist that 
 I pass in, right? If so, then it should be possible for me to write a 
 function that takes tlist and baserel-reltargetlist and return a version of 
 tlist that knows which foreign-table columns are actually used, and replaces 
 the rest with a NULL constant.

 For example, suppose the original tlist is this: [VAR(attrno=1), 
 VAR(attrno=2), VAR(attrno=3)] and reltarget list says that I only need args 
 1 and 3. Then the new tlist would look like this: [VAR(attrno=1), 
 CONST(val=NULL), VAR(attrno=2)] where the attrno of the last VAR has been 
 reduced by one because the 2 column is no longer there.

 I did something very much like this in my roll-your-own version of FDW so I 
 know basically how to do it, but I did it at the pre-planning stage and I'm 
 not sure how much is already packed into the other plan nodes at this point. 
 Maybe it's too late to change the target list?

 Can anyone give me some advice or warnings on this? I'd hate to go to the 
 trouble of implementing and testing it only to find that I'm making some 
 bogus assumptions.

 Thanks,
 David Gudeman



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] a patch for row-at-a-time execution for table functions

2013-04-23 Thread David Gudeman
The primary change we made to Postgres in order to support our own
version of foreign data wrappers was a row-at-a-time execution for
table functions. In standard Postgres, when you execute a table
function, it gathers all of the rows at once and stuffs them into a
buffer in order to support cursors, even if it is just a vanilla
forward scan. We modified the code so that when you do vanilla forward
scans it executes the function one row at a time. This wasn't a big
change since the support for executing functions that way was always
in there but it was bypassed in the plan execution code. Probably
someone always intended to do this, but never got around to it.

We also encountered what I think is a bug in the way that the cleanup
callback functions are called at the end of a table function. I'm not
sure I remember the details correctly, but I believe it was freeing
the memory region used for the private state before calling the
cleanup function. If that was the bug though, why didn't I just use
malloc for the private state? I'll have to review my notes on that
one...

Anyway, my company has agreed to let me post a patch for these
changes. Before I go to the work of getting the patches ready, I'd
like to take the temperature of the commiters and find out if there is
any likelihood of getting these patches accepted.

Anyone?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] minimizing the target list for foreign data wrappers

2013-04-22 Thread David Gudeman
Re-reading my first email I thought it was a little confusing, so here
is some clarification. In GetForeignPlan, tlist seems to be a target
list for a basic select * from the foreign table. For the ith
TargetEntry te in tlist, it seems that te-expr is a var with
varattno=i. I was mis-remembering and calling varattno attrno in the
original email.

My assumption is that the plan elements that use the output of the FDW
plan node will access columns indirectly using tlist. In other words,
I'm assuming that if there is a reference to a column c of the foreign
table, this column will be represented as a Var with varattno being an
offset into tlist. So if c is column number 3, for example, you get
its value by looking up TargetEntry number 3 in tlist and evaluate the
expr column for that TargetEntry. So if I change the  Var in the expr
column so the varattno points to a different column in the output
tuple, then everything will work.

The two risky assumptions I'm making are 1. that it actually uses this
indirect way of looking up columns in a foreign table and 2. that it
actually uses the tlist that I pass in when I call make_foreignscan().

Can anyone confirm or deny these assumptions?

Thanks.

On Sun, Apr 21, 2013 at 6:57 PM, David Gudeman dave.gude...@gmail.com wrote:

 A few years ago I wrote a roll-your-own foreign-data-wrapper system for 
 Postgres because Postgres didn't have one at the time (some details here 
 (http://unobtainabol.blogspot.com/2013/04/dave-foreign-data-introuction.html) 
 if anyone is interested). Now I'm being tasked to move it to Postgres 9.2.x 
 and I'd like to use FDW if possible.

 One of the problems I'm having is that in my application, the foreign tables 
 typically have hundreds of columns while typical queries only access a dozen 
 or so (the foreign server is a columnar SQL database). Furthermore, there is 
 no size optimization for NULL values passed back from the foreign server, so 
 if I return all of the columns from the table --even as NULLs-- the returned 
 data size will be several times the size that it needs to be. My application 
 cannot tolerate this level of inefficiency, so I need to return minimal 
 columns from the foreign table.

 The documentation doesn't say how to do this, but looking at the code I think 
 it is possible. In GetForeignPlan() you have to pass on the tlist argument, 
 which I presume means that the query plan will use the tlist that I pass in, 
 right? If so, then it should be possible for me to write a function that 
 takes tlist and baserel-reltargetlist and return a version of tlist that 
 knows which foreign-table columns are actually used, and replaces the rest 
 with a NULL constant.

 For example, suppose the original tlist is this: [VAR(attrno=1), 
 VAR(attrno=2), VAR(attrno=3)] and reltarget list says that I only need args 1 
 and 3. Then the new tlist would look like this: [VAR(attrno=1), 
 CONST(val=NULL), VAR(attrno=2)] where the attrno of the last VAR has been 
 reduced by one because the 2 column is no longer there.

 I did something very much like this in my roll-your-own version of FDW so I 
 know basically how to do it, but I did it at the pre-planning stage and I'm 
 not sure how much is already packed into the other plan nodes at this point. 
 Maybe it's too late to change the target list?

 Can anyone give me some advice or warnings on this? I'd hate to go to the 
 trouble of implementing and testing it only to find that I'm making some 
 bogus assumptions.

 Thanks,
 David Gudeman



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] minimizing the target list for foreign data wrappers

2013-04-21 Thread David Gudeman
A few years ago I wrote a roll-your-own foreign-data-wrapper system for
Postgres because Postgres didn't have one at the time (some details
herehttp://unubtainabol.blogspot.com/2013/04/daves-foreign-data-introuction.htmlif
anyone is interested). Now I'm being tasked to move it to Postgres
9.2.x
and I'd like to use FDW if possible.

One of the problems I'm having is that in my application, the foreign
tables typically have hundreds of columns while typical queries only access
a dozen or so (the foreign server is a columnar SQL database). Furthermore,
there is no size optimization for NULL values passed back from the foreign
server, so if I return all of the columns from the table --even as NULLs--
the returned data size will be several times the size that it needs to be.
My application cannot tolerate this level of inefficiency, so I need to
return minimal columns from the foreign table.

The documentation doesn't say how to do this, but looking at the code I
think it is possible. In GetForeignPlan() you have to pass on the tlist
argument, which I presume means that the query plan will use the tlist that
I pass in, right? If so, then it should be possible for me to write a
function that takes tlist and baserel-reltargetlist and return a version
of tlist that knows which foreign-table columns are actually used, and
replaces the rest with a NULL constant.

For example, suppose the original tlist is this: [VAR(attrno=1),
VAR(attrno=2), VAR(attrno=3)] and reltarget list says that I only need args
1 and 3. Then the new tlist would look like this: [VAR(attrno=1),
CONST(val=NULL), VAR(attrno=2)] where the attrno of the last VAR has been
reduced by one because the 2 column is no longer there.

I did something very much like this in my roll-your-own version of FDW so I
know basically how to do it, but I did it at the pre-planning stage and I'm
not sure how much is already packed into the other plan nodes at this
point. Maybe it's too late to change the target list?

Can anyone give me some advice or warnings on this? I'd hate to go to the
trouble of implementing and testing it only to find that I'm making some
bogus assumptions.

Thanks,
David Gudeman