Re: [HACKERS] Proposal: Create index on foreign table

2012-03-22 Thread Etsuro Fujita
(2012/03/22 9:24), Tom Lane wrote:
 What's at stake in the current discussion is whether it would be
 advantageous to an FDW if we were to store some information about
 remote indexes in the local catalogs.  It would still be the FDW's
 responsibility, and nobody else's, to make use of that information.
 I can believe that we might eventually decide to do that; but I do not
 think we have enough experience with different sorts of FDWs to define
 a good solution today.  And I think that most likely a good solution
 will *not* conflate such remote-index information with local indexes.
 
 So basically my reaction to Etsuro-san's proposal is this is
 premature.  I think he should be hacking together some FDW-private
 facilities for individual FDWs instead (with the full understanding
 that these might be throwaway prototypes), and then looking for a
 common abstraction after he's done a few of those.

OK.  I'd like to at first focus on file FDW and Postgres FDW.  I'd like
to thank everyone who commented on this topic.  Thanks!

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-21 Thread Etsuro Fujita

(2012/03/21 4:39), Robert Haas wrote:

On Fri, Mar 16, 2012 at 4:44 AM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp  wrote:



For a flat file, CREATE INDEX constructs an index in the same way as an
index for a regular table.


It seems really weird to have the data half inside the database and
half outside of it like this.  What is the use case for this feature?
I would have thought that something like file_fdw would be primarily
targeted toward use cases where you intend to read the data file only
a few times, or you always intend to sequential scan it.  If you need
to index it, why not store the data in a regular table?  That's what
they are for.  What you're proposing sounds to me like a lot of work
for no real gain.


This feature is planned to used to save time and space for loading file 
data into Postgres.  As you know, loading data is time-consuming.  In 
addition, it requires twice the disk space in the case where source 
files have to be stored against the time of need.  I think this feature 
is especially useful for data warehouse environments.



On the other hand, for a remote table,
CREATE INDEX collects information about the index on the specified
column(s) for the specified table that was created on the remote table.


I can possibly see the point of this.  Having local information about
which remote indexes are available seems like it could be useful.  As
Heikki says, you could cache it on a per-session basis, but that might
not be very efficient.

I also think that it would not be a very good idea to have CREATE
INDEX on a foreign table sometimes really create an index and other
times just define the properties of a remote index.  If we're going to
have both features at all, I think they should use different syntax.
I suggest that CREATE INDEX or CREATE FOREIGN INDEX is the right way
to provide the query planner with information about remote-side
indexes; and that if we even want to have indexes on flat files, the
interface to those should be exported via functions bundled in the
file_fdw extension rather than DDL.  There's no reason to suppose that
the indexes the FDW supports correspond to PostgreSQL's AMs, so tying
it into that framework doesn't seem wise.


I did an investigation on DB2 a little bit.  DB2 uses the CREATE INDEX 
SPECIFICATION ONLY statement to define the properties of a remote index.


CREATE INDEX index_name ON foreintable_name
(column_name) SPECIFICATION ONLY

How about introducing this kind of option?; Using the CREATE INDEX 
statement with the SPECIFICATION ONLY option, a user can just define the 
properties of a remote index.  On the other hand, using the statement 
without this option, he or she can specify more options like the USING 
option and really create an index, which requires that the FDW's AMs 
correspond to Postgres's AMs, as pointed out by you.  If the real index 
of an external data is considered as just a complementary data for 
efficient query processing like stats to be collected for the external 
data by the ANALYZE statement, it doen't seem so weird to use the DDL 
for the external data, create the real index for it, and store the index 
data inside Postgres.


Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-21 Thread Robert Haas
On Wed, Mar 21, 2012 at 4:47 AM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 I did an investigation on DB2 a little bit.  DB2 uses the CREATE INDEX
 SPECIFICATION ONLY statement to define the properties of a remote index.

    CREATE INDEX index_name ON foreintable_name
    (column_name) SPECIFICATION ONLY

 How about introducing this kind of option?; Using the CREATE INDEX statement
 with the SPECIFICATION ONLY option, a user can just define the properties of
 a remote index.  On the other hand, using the statement without this option,
 he or she can specify more options like the USING option and really create
 an index, which requires that the FDW's AMs correspond to Postgres's AMs, as
 pointed out by you.  If the real index of an external data is considered as
 just a complementary data for efficient query processing like stats to be
 collected for the external data by the ANALYZE statement, it doen't seem so
 weird to use the DDL for the external data, create the real index for it,
 and store the index data inside Postgres.

I still don't think it's a good idea to introduce the concept of a
PostgreSQL index that indexes data not stored in the database.  There
is some pretty serious impedance mismatch there.  PostgreSQL indexes
are intended to store CTIDs; you might be able to hack things for
file_fdw to make a byte offset look like a CTID, but in general I
don't think you can count on making that work.  There's no guarantee
that a particular FDW provides unique identifiers for every data
element that fit in six bytes and allow for fast retrieval.  In fact,
beyond flat files, I suspect that's more the exception than the norm.
I agree with you that our bulk loading isn't fast enough (or
space-efficient enough) but I don't think the right solution is to
contort our index code, which is not designed to do this and probably
won't handle it very gracefully.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Mar 21, 2012 at 4:47 AM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:
 I did an investigation on DB2 a little bit.  DB2 uses the CREATE INDEX
 SPECIFICATION ONLY statement to define the properties of a remote index.
 
CREATE INDEX index_name ON foreintable_name
(column_name) SPECIFICATION ONLY
 
 How about introducing this kind of option?

 I still don't think it's a good idea to introduce the concept of a
 PostgreSQL index that indexes data not stored in the database.

Well, above Etsuro-san is proposing the other case, ie a Postgres index
definition for an index *not* stored in the database.  But frankly
I think both ideas are pretty bad.  There's no reason to expect that
Postgres' model of an index will accurately describe the facilities
available in a remote server; and if it's not accurate, is it really
of any use for planning?

There might be reason to store information about remote indexes locally
(although nobody has yet proven that session-local caching isn't
sufficient).  But there is very little reason to suppose that that
information will look the same for all different types of remote
servers, and I don't believe we know enough yet to design a common
representation.  I think it'd be wise to let the different FDWs solve
this problem on their own for a couple of years, and then we can look
at whether they have common requirements or not.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-21 Thread Greg Stark
On Wed, Mar 21, 2012 at 2:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, above Etsuro-san is proposing the other case, ie a Postgres index
 definition for an index *not* stored in the database.  But frankly
 I think both ideas are pretty bad.  There's no reason to expect that
 Postgres' model of an index will accurately describe the facilities
 available in a remote server; and if it's not accurate, is it really
 of any use for planning?

Surely this will ultimately be necessary though? Some of the other use
cases for fdw are things like federated databases spread over multiple
postgres databases or even other SQL databases. They won't be very
practical if they can only implement sequential scans.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-21 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Wed, Mar 21, 2012 at 2:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, above Etsuro-san is proposing the other case, ie a Postgres index
 definition for an index *not* stored in the database.  But frankly
 I think both ideas are pretty bad.  There's no reason to expect that
 Postgres' model of an index will accurately describe the facilities
 available in a remote server; and if it's not accurate, is it really
 of any use for planning?

 Surely this will ultimately be necessary though?

At the moment, whether a foreign table has indexes and when to use them
is strictly a private matter between the FDW and the remote server (if
any).  The rest of the system doesn't have any need-to-know whatsoever,
and I don't foresee that it ever will, at least not without drastic
redesign of the FDW API.

What's at stake in the current discussion is whether it would be
advantageous to an FDW if we were to store some information about
remote indexes in the local catalogs.  It would still be the FDW's
responsibility, and nobody else's, to make use of that information.
I can believe that we might eventually decide to do that; but I do not
think we have enough experience with different sorts of FDWs to define
a good solution today.  And I think that most likely a good solution
will *not* conflate such remote-index information with local indexes.

So basically my reaction to Etsuro-san's proposal is this is
premature.  I think he should be hacking together some FDW-private
facilities for individual FDWs instead (with the full understanding
that these might be throwaway prototypes), and then looking for a
common abstraction after he's done a few of those.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-20 Thread Robert Haas
On Fri, Mar 16, 2012 at 4:44 AM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 I have a plan to support 'Create index on foreign table' for 9.3.  Here
 is my plan.

 The index creation is supported for a flat file such as CSV and a remote
 table on a RDB e.g., Postgres using CREATE INDEX.  (I thought using a
 new statement, CREATE FOREIGN INDEX, at first, but I think that CREATE
 INDEX would be sufficient to define an index for the foreign table.)
 For a flat file, CREATE INDEX constructs an index in the same way as an
 index for a regular table.

It seems really weird to have the data half inside the database and
half outside of it like this.  What is the use case for this feature?
I would have thought that something like file_fdw would be primarily
targeted toward use cases where you intend to read the data file only
a few times, or you always intend to sequential scan it.  If you need
to index it, why not store the data in a regular table?  That's what
they are for.  What you're proposing sounds to me like a lot of work
for no real gain.

 On the other hand, for a remote table,
 CREATE INDEX collects information about the index on the specified
 column(s) for the specified table that was created on the remote table.

I can possibly see the point of this.  Having local information about
which remote indexes are available seems like it could be useful.  As
Heikki says, you could cache it on a per-session basis, but that might
not be very efficient.

I also think that it would not be a very good idea to have CREATE
INDEX on a foreign table sometimes really create an index and other
times just define the properties of a remote index.  If we're going to
have both features at all, I think they should use different syntax.
I suggest that CREATE INDEX or CREATE FOREIGN INDEX is the right way
to provide the query planner with information about remote-side
indexes; and that if we even want to have indexes on flat files, the
interface to those should be exported via functions bundled in the
file_fdw extension rather than DDL.  There's no reason to suppose that
the indexes the FDW supports correspond to PostgreSQL's AMs, so tying
it into that framework doesn't seem wise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-18 Thread Etsuro Fujita

(2012/03/16 22:51), Shigeru Hanada wrote:

2012/3/16 Etsuro Fujitafujita.ets...@lab.ntt.co.jp:



The index creation is supported for a flat file such as CSV and a remote
table on a RDB e.g., Postgres using CREATE INDEX.



IMHO CREATE INDEX for foreign tables should have general design,
not specific to some kind of FDWs.


OK.  I'll try to design CREATE INDEX more generally, though I'll at 
first focus on those two.



I'd like to build the index physical data file for a flat file using the
index access method of regular tables (ie btree, hash, gin, gist, and
spgist) based on the following transformation between the TID and the
file offset to some data in the file:

block_number  = file_offset_to_some_data / BLCKSZ
offset_number = file_offset_to_some_data % BLCKSZ


Indeed these information would help searching data stored in local
files.  But, again, it seems too specific to file-based FDWs.  I'd
suggest separating basic general design and implementation by FDWs.
The design you shown here seems indexed-file_fdw to me...


This transformation is for a flat file.  I think an FDW author who wants 
to build the index physical data file for an external data may choose 
any transformation that defines a one-to-one mapping into the TID space.


Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-18 Thread Etsuro Fujita

(2012/03/17 2:07), David Fetter wrote:

On Fri, Mar 16, 2012 at 11:58:29AM +0200, Heikki Linnakangas wrote:

On 16.03.2012 10:44, Etsuro Fujita wrote:



For a flat file, CREATE INDEX constructs
an index in the same way as an index for a regular table.



For starters, how would you keep the index up-to-date when the flat
file is modified?



One way is to poll the remote source for evidence of such changes
during auto_vacuum or with similar daemon processes.  Another is by
waiting for a signal from an external source such as a NOTIFY.  Which
is more appropriate will again depend on circumstances.


I think that's a good idea.  I'd like to work on it in the future.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-16 Thread Heikki Linnakangas
On 16.03.2012 10:44, Etsuro Fujita wrote:
 I have a plan to support 'Create index on foreign table' for 9.3.  Here
 is my plan.
 
 The index creation is supported for a flat file such as CSV and a remote
 table on a RDB e.g., Postgres using CREATE INDEX.  (I thought using a
 new statement, CREATE FOREIGN INDEX, at first, but I think that CREATE
 INDEX would be sufficient to define an index for the foreign table.)
 For a flat file, CREATE INDEX constructs an index in the same way as an
 index for a regular table.

I think this belongs completely in the remote data source. If you want
to index flat files, create an extra file for it or something, and
enhance the wrapper so that it can take advantage of it. Keeping the
index inside the database while the data is somewhere else creates a
whole new class of problems. For starters, how would you keep the index
up-to-date when the flat file is modified? If you want to take advantage
of PostgreSQL's indexing, you'll just have to just load the data into a
regular table.

  On the other hand, for a remote table,
 CREATE INDEX collects information about the index on the specified
 column(s) for the specified table that was created on the remote table.

I don't see the point of this either. The planner asks the FDW for cost
estimates, and if the FDW knows about indexes in the remote server, it
can certainly adjust the estimates accordingly. But that's all internal
to the FDW. It might want delegate the whole cost estimation to the
remote server by running EXPLAIN there, or it could use its knowledge of
indexes that exist there, but I don't see why the rest of the system
would need to know what indexes there are in the remote system. If the
FDW needs that information, it can query the remote server for it on
first access, and cache the information for the lifetime of the session.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-16 Thread Etsuro Fujita
(2012/03/16 18:58), Heikki Linnakangas wrote:
 On 16.03.2012 10:44, Etsuro Fujita wrote:
 I have a plan to support 'Create index on foreign table' for 9.3.  Here
 is my plan.

 The index creation is supported for a flat file such as CSV and a remote
 table on a RDB e.g., Postgres using CREATE INDEX.  (I thought using a
 new statement, CREATE FOREIGN INDEX, at first, but I think that CREATE
 INDEX would be sufficient to define an index for the foreign table.)
 For a flat file, CREATE INDEX constructs an index in the same way as an
 index for a regular table.
 
 I think this belongs completely in the remote data source. If you want
 to index flat files, create an extra file for it or something, and
 enhance the wrapper so that it can take advantage of it. Keeping the
 index inside the database while the data is somewhere else creates a
 whole new class of problems. For starters, how would you keep the index
 up-to-date when the flat file is modified?

Index update is outside the scope at least until foreign table update is
supported.  It is required for the user to do DROP INDEX and then do
CREATE INDEX again when the file has been modified.

I plan to implement the GetForeignPaths callback routine of file_fdw to
just give up creating index paths if the file's checksum or timestamp or
something has changed.  I think the index of the file is something
similar to the stats of it in a certain sense.

   On the other hand, for a remote table,
 CREATE INDEX collects information about the index on the specified
 column(s) for the specified table that was created on the remote table.
 
 I don't see the point of this either. The planner asks the FDW for cost
 estimates, and if the FDW knows about indexes in the remote server, it
 can certainly adjust the estimates accordingly. But that's all internal
 to the FDW. It might want delegate the whole cost estimation to the
 remote server by running EXPLAIN there, or it could use its knowledge of
 indexes that exist there, but I don't see why the rest of the system
 would need to know what indexes there are in the remote system.

In the case of joining A on the local system and B on the remote system,
for example, it is required for the local system to know what indexes
there are on B in the remote system in order to consider index paths
parametrized by A for nestloop-with-inner-parametrized-scan paths.

 If the
 FDW needs that information, it can query the remote server for it on
 first access, and cache the information for the lifetime of the session.

I think that is one of the choices.  However, it seems convenient to me
to utilize the existing framework for index information because this
approach has the possibility for the FDWs to share e.g., the path
creation processing with Postgres core to some extent to create index
paths, of course which requires to refactor existing query optimization
code.

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-16 Thread Shigeru Hanada
2012/3/16 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
 I have a plan to support 'Create index on foreign table' for 9.3.  Here
 is my plan.

Very interesting idea, but...

 The index creation is supported for a flat file such as CSV and a remote
 table on a RDB e.g., Postgres using CREATE INDEX.

Why do you limit the target type to those two?  How about web
services and non-relational databases?  Some web services would
provide id-to-content mapping, and KVSs are obviously accessible by
key.  IMHO CREATE INDEX for foreign tables should have general design,
not specific to some kind of FDWs.

 I'd like to build the index physical data file for a flat file using the
 index access method of regular tables (ie btree, hash, gin, gist, and
 spgist) based on the following transformation between the TID and the
 file offset to some data in the file:

    block_number  = file_offset_to_some_data / BLCKSZ
    offset_number = file_offset_to_some_data % BLCKSZ

Indeed these information would help searching data stored in local
files.  But, again, it seems too specific to file-based FDWs.  I'd
suggest separating basic general design and implementation by FDWs.
The design you shown here seems indexed-file_fdw to me...

Regards,
-- 
Shigeru HANADA

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Create index on foreign table

2012-03-16 Thread David Fetter
On Fri, Mar 16, 2012 at 11:58:29AM +0200, Heikki Linnakangas wrote:
 On 16.03.2012 10:44, Etsuro Fujita wrote:
  I have a plan to support 'Create index on foreign table' for 9.3.
  Here is my plan.
  
  The index creation is supported for a flat file such as CSV and a

As others, I don't see a reason to restrict this to some particular
type of FDW.

  remote table on a RDB e.g., Postgres using CREATE INDEX.  (I
  thought using a new statement, CREATE FOREIGN INDEX, at first, but
  I think that CREATE INDEX would be sufficient to define an index
  for the foreign table.) For a flat file, CREATE INDEX constructs
  an index in the same way as an index for a regular table.
 
 I think this belongs completely in the remote data source.

I think this needs to be decided on a case-by-case basis instead.

 If you want to index flat files, create an extra file for it or
 something, and enhance the wrapper so that it can take advantage of
 it. Keeping the index inside the database while the data is
 somewhere else creates a whole new class of problems.

How?  These aren't super different from those for, say, unlogged
tables.

 For starters, how would you keep the index up-to-date when the flat
 file is modified?

One way is to poll the remote source for evidence of such changes
during auto_vacuum or with similar daemon processes.  Another is by
waiting for a signal from an external source such as a NOTIFY.  Which
is more appropriate will again depend on circumstances.

 If you want to take advantage of PostgreSQL's indexing, you'll just
 have to just load the data into a regular table.

I disagree.  Indexing in general allows you to store only log-N index
rows for each N rows in an external table, which could be a very big
win.  Deciding in advance for everyone that this is not worthwhile is
not in our purview.

  On the other hand, for a remote table, CREATE INDEX collects
  information about the index on the specified column(s) for the
  specified table that was created on the remote table.
 
 I don't see the point of this either. The planner asks the FDW for
 cost estimates, and if the FDW knows about indexes in the remote
 server, it can certainly adjust the estimates accordingly. But
 that's all internal to the FDW. It might want delegate the whole
 cost estimation to the remote server by running EXPLAIN there, or it
 could use its knowledge of indexes that exist there, but I don't see
 why the rest of the system would need to know what indexes there are
 in the remote system.

Good point, for the remote index case, which I contend is not every
one :)

 If the FDW needs that information, it can query the remote server
 for it on first access, and cache the information for the lifetime
 of the session.

Of course, a mere few GB of information queried each time couldn't
possibly cause intolerable overheads...

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers