Re: [HACKERS] Proposal: Create index on foreign table
(2012/03/22 9:24), Tom Lane wrote: > What's at stake in the current discussion is whether it would be > advantageous to an FDW if we were to store some information about > remote indexes in the local catalogs. It would still be the FDW's > responsibility, and nobody else's, to make use of that information. > I can believe that we might eventually decide to do that; but I do not > think we have enough experience with different sorts of FDWs to define > a good solution today. And I think that most likely a good solution > will *not* conflate such remote-index information with local indexes. > > So basically my reaction to Etsuro-san's proposal is "this is > premature". I think he should be hacking together some FDW-private > facilities for individual FDWs instead (with the full understanding > that these might be throwaway prototypes), and then looking for a > common abstraction after he's done a few of those. OK. I'd like to at first focus on file FDW and Postgres FDW. I'd like to thank everyone who commented on this topic. Thanks! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
Greg Stark writes: > On Wed, Mar 21, 2012 at 2:22 PM, Tom Lane wrote: >> Well, above Etsuro-san is proposing the other case, ie a Postgres index >> definition for an index *not* stored in the database. But frankly >> I think both ideas are pretty bad. There's no reason to expect that >> Postgres' model of an index will accurately describe the facilities >> available in a remote server; and if it's not accurate, is it really >> of any use for planning? > Surely this will ultimately be necessary though? At the moment, whether a foreign table has indexes and when to use them is strictly a private matter between the FDW and the remote server (if any). The rest of the system doesn't have any need-to-know whatsoever, and I don't foresee that it ever will, at least not without drastic redesign of the FDW API. What's at stake in the current discussion is whether it would be advantageous to an FDW if we were to store some information about remote indexes in the local catalogs. It would still be the FDW's responsibility, and nobody else's, to make use of that information. I can believe that we might eventually decide to do that; but I do not think we have enough experience with different sorts of FDWs to define a good solution today. And I think that most likely a good solution will *not* conflate such remote-index information with local indexes. So basically my reaction to Etsuro-san's proposal is "this is premature". I think he should be hacking together some FDW-private facilities for individual FDWs instead (with the full understanding that these might be throwaway prototypes), and then looking for a common abstraction after he's done a few of those. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
On Wed, Mar 21, 2012 at 2:22 PM, Tom Lane wrote: > Well, above Etsuro-san is proposing the other case, ie a Postgres index > definition for an index *not* stored in the database. But frankly > I think both ideas are pretty bad. There's no reason to expect that > Postgres' model of an index will accurately describe the facilities > available in a remote server; and if it's not accurate, is it really > of any use for planning? Surely this will ultimately be necessary though? Some of the other use cases for fdw are things like federated databases spread over multiple postgres databases or even other SQL databases. They won't be very practical if they can only implement sequential scans. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
Robert Haas writes: > On Wed, Mar 21, 2012 at 4:47 AM, Etsuro Fujita > wrote: >> I did an investigation on DB2 a little bit. DB2 uses the CREATE INDEX >> SPECIFICATION ONLY statement to define the properties of a remote index. >> >>CREATE INDEX index_name ON foreintable_name >>(column_name) SPECIFICATION ONLY >> >> How about introducing this kind of option? > I still don't think it's a good idea to introduce the concept of a > PostgreSQL index that indexes data not stored in the database. Well, above Etsuro-san is proposing the other case, ie a Postgres index definition for an index *not* stored in the database. But frankly I think both ideas are pretty bad. There's no reason to expect that Postgres' model of an index will accurately describe the facilities available in a remote server; and if it's not accurate, is it really of any use for planning? There might be reason to store information about remote indexes locally (although nobody has yet proven that session-local caching isn't sufficient). But there is very little reason to suppose that that information will look the same for all different types of remote servers, and I don't believe we know enough yet to design a common representation. I think it'd be wise to let the different FDWs solve this problem on their own for a couple of years, and then we can look at whether they have common requirements or not. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
On Wed, Mar 21, 2012 at 4:47 AM, Etsuro Fujita wrote: > I did an investigation on DB2 a little bit. DB2 uses the CREATE INDEX > SPECIFICATION ONLY statement to define the properties of a remote index. > > CREATE INDEX index_name ON foreintable_name > (column_name) SPECIFICATION ONLY > > How about introducing this kind of option?; Using the CREATE INDEX statement > with the SPECIFICATION ONLY option, a user can just define the properties of > a remote index. On the other hand, using the statement without this option, > he or she can specify more options like the USING option and really create > an index, which requires that the FDW's AMs correspond to Postgres's AMs, as > pointed out by you. If the real index of an external data is considered as > just a complementary data for efficient query processing like stats to be > collected for the external data by the ANALYZE statement, it doen't seem so > weird to use the DDL for the external data, create the real index for it, > and store the index data inside Postgres. I still don't think it's a good idea to introduce the concept of a PostgreSQL index that indexes data not stored in the database. There is some pretty serious impedance mismatch there. PostgreSQL indexes are intended to store CTIDs; you might be able to hack things for file_fdw to make a byte offset look like a CTID, but in general I don't think you can count on making that work. There's no guarantee that a particular FDW provides unique identifiers for every data element that fit in six bytes and allow for fast retrieval. In fact, beyond flat files, I suspect that's more the exception than the norm. I agree with you that our bulk loading isn't fast enough (or space-efficient enough) but I don't think the right solution is to contort our index code, which is not designed to do this and probably won't handle it very gracefully. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
(2012/03/21 4:39), Robert Haas wrote: On Fri, Mar 16, 2012 at 4:44 AM, Etsuro Fujita wrote: For a flat file, CREATE INDEX constructs an index in the same way as an index for a regular table. It seems really weird to have the data half inside the database and half outside of it like this. What is the use case for this feature? I would have thought that something like file_fdw would be primarily targeted toward use cases where you intend to read the data file only a few times, or you always intend to sequential scan it. If you need to index it, why not store the data in a regular table? That's what they are for. What you're proposing sounds to me like a lot of work for no real gain. This feature is planned to used to save time and space for loading file data into Postgres. As you know, loading data is time-consuming. In addition, it requires twice the disk space in the case where source files have to be stored against the time of need. I think this feature is especially useful for data warehouse environments. On the other hand, for a remote table, CREATE INDEX collects information about the index on the specified column(s) for the specified table that was created on the remote table. I can possibly see the point of this. Having local information about which remote indexes are available seems like it could be useful. As Heikki says, you could cache it on a per-session basis, but that might not be very efficient. I also think that it would not be a very good idea to have CREATE INDEX on a foreign table sometimes really create an index and other times just define the properties of a remote index. If we're going to have both features at all, I think they should use different syntax. I suggest that CREATE INDEX or CREATE FOREIGN INDEX is the right way to provide the query planner with information about remote-side indexes; and that if we even want to have indexes on flat files, the interface to those should be exported via functions bundled in the file_fdw extension rather than DDL. There's no reason to suppose that the indexes the FDW supports correspond to PostgreSQL's AMs, so tying it into that framework doesn't seem wise. I did an investigation on DB2 a little bit. DB2 uses the CREATE INDEX SPECIFICATION ONLY statement to define the properties of a remote index. CREATE INDEX index_name ON foreintable_name (column_name) SPECIFICATION ONLY How about introducing this kind of option?; Using the CREATE INDEX statement with the SPECIFICATION ONLY option, a user can just define the properties of a remote index. On the other hand, using the statement without this option, he or she can specify more options like the USING option and really create an index, which requires that the FDW's AMs correspond to Postgres's AMs, as pointed out by you. If the real index of an external data is considered as just a complementary data for efficient query processing like stats to be collected for the external data by the ANALYZE statement, it doen't seem so weird to use the DDL for the external data, create the real index for it, and store the index data inside Postgres. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
On Fri, Mar 16, 2012 at 4:44 AM, Etsuro Fujita wrote: > I have a plan to support 'Create index on foreign table' for 9.3. Here > is my plan. > > The index creation is supported for a flat file such as CSV and a remote > table on a RDB e.g., Postgres using CREATE INDEX. (I thought using a > new statement, CREATE FOREIGN INDEX, at first, but I think that CREATE > INDEX would be sufficient to define an index for the foreign table.) > For a flat file, CREATE INDEX constructs an index in the same way as an > index for a regular table. It seems really weird to have the data half inside the database and half outside of it like this. What is the use case for this feature? I would have thought that something like file_fdw would be primarily targeted toward use cases where you intend to read the data file only a few times, or you always intend to sequential scan it. If you need to index it, why not store the data in a regular table? That's what they are for. What you're proposing sounds to me like a lot of work for no real gain. > On the other hand, for a remote table, > CREATE INDEX collects information about the index on the specified > column(s) for the specified table that was created on the remote table. I can possibly see the point of this. Having local information about which remote indexes are available seems like it could be useful. As Heikki says, you could cache it on a per-session basis, but that might not be very efficient. I also think that it would not be a very good idea to have CREATE INDEX on a foreign table sometimes really create an index and other times just define the properties of a remote index. If we're going to have both features at all, I think they should use different syntax. I suggest that CREATE INDEX or CREATE FOREIGN INDEX is the right way to provide the query planner with information about remote-side indexes; and that if we even want to have indexes on flat files, the interface to those should be exported via functions bundled in the file_fdw extension rather than DDL. There's no reason to suppose that the indexes the FDW supports correspond to PostgreSQL's AMs, so tying it into that framework doesn't seem wise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
(2012/03/17 2:07), David Fetter wrote: On Fri, Mar 16, 2012 at 11:58:29AM +0200, Heikki Linnakangas wrote: On 16.03.2012 10:44, Etsuro Fujita wrote: For a flat file, CREATE INDEX constructs an index in the same way as an index for a regular table. For starters, how would you keep the index up-to-date when the flat file is modified? One way is to poll the remote source for evidence of such changes during auto_vacuum or with similar daemon processes. Another is by waiting for a signal from an external source such as a NOTIFY. Which is more appropriate will again depend on circumstances. I think that's a good idea. I'd like to work on it in the future. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
(2012/03/16 22:51), Shigeru Hanada wrote: 2012/3/16 Etsuro Fujita: The index creation is supported for a flat file such as CSV and a remote table on a RDB e.g., Postgres using CREATE INDEX. IMHO CREATE INDEX for foreign tables should have general design, not specific to some kind of FDWs. OK. I'll try to design CREATE INDEX more generally, though I'll at first focus on those two. I'd like to build the index physical data file for a flat file using the index access method of regular tables (ie btree, hash, gin, gist, and spgist) based on the following transformation between the TID and the file offset to some data in the file: block_number = file_offset_to_some_data / BLCKSZ offset_number = file_offset_to_some_data % BLCKSZ Indeed these information would help searching data stored in local files. But, again, it seems too specific to file-based FDWs. I'd suggest separating basic general design and implementation by FDWs. The design you shown here seems indexed-file_fdw to me... This transformation is for a flat file. I think an FDW author who wants to build the index physical data file for an external data may choose any transformation that defines a one-to-one mapping into the TID space. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
On Fri, Mar 16, 2012 at 11:58:29AM +0200, Heikki Linnakangas wrote: > On 16.03.2012 10:44, Etsuro Fujita wrote: > > I have a plan to support 'Create index on foreign table' for 9.3. > > Here is my plan. > > > > The index creation is supported for a flat file such as CSV and a As others, I don't see a reason to restrict this to some particular type of FDW. > > remote table on a RDB e.g., Postgres using CREATE INDEX. (I > > thought using a new statement, CREATE FOREIGN INDEX, at first, but > > I think that CREATE INDEX would be sufficient to define an index > > for the foreign table.) For a flat file, CREATE INDEX constructs > > an index in the same way as an index for a regular table. > > I think this belongs completely in the remote data source. I think this needs to be decided on a case-by-case basis instead. > If you want to index flat files, create an extra file for it or > something, and enhance the wrapper so that it can take advantage of > it. Keeping the index inside the database while the data is > somewhere else creates a whole new class of problems. How? These aren't super different from those for, say, unlogged tables. > For starters, how would you keep the index up-to-date when the flat > file is modified? One way is to poll the remote source for evidence of such changes during auto_vacuum or with similar daemon processes. Another is by waiting for a signal from an external source such as a NOTIFY. Which is more appropriate will again depend on circumstances. > If you want to take advantage of PostgreSQL's indexing, you'll just > have to just load the data into a regular table. I disagree. Indexing in general allows you to store only log-N index rows for each N rows in an external table, which could be a very big win. Deciding in advance for everyone that this is not worthwhile is not in our purview. > > On the other hand, for a remote table, CREATE INDEX collects > > information about the index on the specified column(s) for the > > specified table that was created on the remote table. > > I don't see the point of this either. The planner asks the FDW for > cost estimates, and if the FDW knows about indexes in the remote > server, it can certainly adjust the estimates accordingly. But > that's all internal to the FDW. It might want delegate the whole > cost estimation to the remote server by running EXPLAIN there, or it > could use its knowledge of indexes that exist there, but I don't see > why the rest of the system would need to know what indexes there are > in the remote system. Good point, for the remote index case, which I contend is not every one :) > If the FDW needs that information, it can query the remote server > for it on first access, and cache the information for the lifetime > of the session. Of course, a mere few GB of information queried each time couldn't possibly cause intolerable overheads... Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
2012/3/16 Etsuro Fujita : > I have a plan to support 'Create index on foreign table' for 9.3. Here > is my plan. Very interesting idea, but... > The index creation is supported for a flat file such as CSV and a remote > table on a RDB e.g., Postgres using CREATE INDEX. Why do you limit the target type to those two? How about web services and non-relational databases? Some web services would provide id-to-content mapping, and KVSs are obviously accessible by key. IMHO CREATE INDEX for foreign tables should have general design, not specific to some kind of FDWs. > I'd like to build the index physical data file for a flat file using the > index access method of regular tables (ie btree, hash, gin, gist, and > spgist) based on the following transformation between the TID and the > file offset to some data in the file: > > block_number = file_offset_to_some_data / BLCKSZ > offset_number = file_offset_to_some_data % BLCKSZ Indeed these information would help searching data stored in local files. But, again, it seems too specific to file-based FDWs. I'd suggest separating basic general design and implementation by FDWs. The design you shown here seems indexed-file_fdw to me... Regards, -- Shigeru HANADA -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
(2012/03/16 18:58), Heikki Linnakangas wrote: > On 16.03.2012 10:44, Etsuro Fujita wrote: >> I have a plan to support 'Create index on foreign table' for 9.3. Here >> is my plan. >> >> The index creation is supported for a flat file such as CSV and a remote >> table on a RDB e.g., Postgres using CREATE INDEX. (I thought using a >> new statement, CREATE FOREIGN INDEX, at first, but I think that CREATE >> INDEX would be sufficient to define an index for the foreign table.) >> For a flat file, CREATE INDEX constructs an index in the same way as an >> index for a regular table. > > I think this belongs completely in the remote data source. If you want > to index flat files, create an extra file for it or something, and > enhance the wrapper so that it can take advantage of it. Keeping the > index inside the database while the data is somewhere else creates a > whole new class of problems. For starters, how would you keep the index > up-to-date when the flat file is modified? Index update is outside the scope at least until foreign table update is supported. It is required for the user to do DROP INDEX and then do CREATE INDEX again when the file has been modified. I plan to implement the GetForeignPaths callback routine of file_fdw to just give up creating index paths if the file's checksum or timestamp or something has changed. I think the index of the file is something similar to the stats of it in a certain sense. >> On the other hand, for a remote table, >> CREATE INDEX collects information about the index on the specified >> column(s) for the specified table that was created on the remote table. > > I don't see the point of this either. The planner asks the FDW for cost > estimates, and if the FDW knows about indexes in the remote server, it > can certainly adjust the estimates accordingly. But that's all internal > to the FDW. It might want delegate the whole cost estimation to the > remote server by running EXPLAIN there, or it could use its knowledge of > indexes that exist there, but I don't see why the rest of the system > would need to know what indexes there are in the remote system. In the case of joining A on the local system and B on the remote system, for example, it is required for the local system to know what indexes there are on B in the remote system in order to consider index paths parametrized by A for nestloop-with-inner-parametrized-scan paths. > If the > FDW needs that information, it can query the remote server for it on > first access, and cache the information for the lifetime of the session. I think that is one of the choices. However, it seems convenient to me to utilize the existing framework for index information because this approach has the possibility for the FDWs to share e.g., the path creation processing with Postgres core to some extent to create index paths, of course which requires to refactor existing query optimization code. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Create index on foreign table
On 16.03.2012 10:44, Etsuro Fujita wrote: > I have a plan to support 'Create index on foreign table' for 9.3. Here > is my plan. > > The index creation is supported for a flat file such as CSV and a remote > table on a RDB e.g., Postgres using CREATE INDEX. (I thought using a > new statement, CREATE FOREIGN INDEX, at first, but I think that CREATE > INDEX would be sufficient to define an index for the foreign table.) > For a flat file, CREATE INDEX constructs an index in the same way as an > index for a regular table. I think this belongs completely in the remote data source. If you want to index flat files, create an extra file for it or something, and enhance the wrapper so that it can take advantage of it. Keeping the index inside the database while the data is somewhere else creates a whole new class of problems. For starters, how would you keep the index up-to-date when the flat file is modified? If you want to take advantage of PostgreSQL's indexing, you'll just have to just load the data into a regular table. > On the other hand, for a remote table, > CREATE INDEX collects information about the index on the specified > column(s) for the specified table that was created on the remote table. I don't see the point of this either. The planner asks the FDW for cost estimates, and if the FDW knows about indexes in the remote server, it can certainly adjust the estimates accordingly. But that's all internal to the FDW. It might want delegate the whole cost estimation to the remote server by running EXPLAIN there, or it could use its knowledge of indexes that exist there, but I don't see why the rest of the system would need to know what indexes there are in the remote system. If the FDW needs that information, it can query the remote server for it on first access, and cache the information for the lifetime of the session. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal: Create index on foreign table
I have a plan to support 'Create index on foreign table' for 9.3. Here is my plan. The index creation is supported for a flat file such as CSV and a remote table on a RDB e.g., Postgres using CREATE INDEX. (I thought using a new statement, CREATE FOREIGN INDEX, at first, but I think that CREATE INDEX would be sufficient to define an index for the foreign table.) For a flat file, CREATE INDEX constructs an index in the same way as an index for a regular table. On the other hand, for a remote table, CREATE INDEX collects information about the index on the specified column(s) for the specified table that was created on the remote table. An index created is stored in pg_class and pg_index like an index for a regular table. It depends on the wrappers implementation whether it supports the options such as UNIQUE or WHERE predicates, though I think that CONCURRENTLY is not supported in common for the foreign tables. For a flat file, I plan that the user can specify all the options excluding CONCURRENTLY and UNIQUE. On the other hand, for a remote table, I think that the user can specify only the names of the foreign table and its column(s), using which the wrapper collects information about all the related indexes created on the remote table. To do so, I'd like to propose new FDW callback routines: CreateIndex(): This is called maybe from DefineIndex(), and does the similar task to index_create(). For a flat file, this function makes the catalog entries for the index and actually build the index, while for a remote table, it just stores the catalog entries collected from the remote end. DropIndex(): This is called at DROP INDEX, and does the similar task to index_drop(). I'd like to build the index physical data file for a flat file using the index access method of regular tables (ie btree, hash, gin, gist, and spgist) based on the following transformation between the TID and the file offset to some data in the file: block_number = file_offset_to_some_data / BLCKSZ offset_number = file_offset_to_some_data % BLCKSZ I plan to make use of the above index for better query optimization. For a flat file, I'd like to realize index scans, index-only scans, bitmap (like) scans and parametrized scans on the file in the same way as those on a regular table utilizing the currently revised FDW infrastructure. For a remote table, I have to admit that I don't have any clear idea to make use of the index information stored in the system catalogs for better query optimization, but I believe that it's useful for the ORDER BY push down and/or nestloop-with-inner-parametrized-scan join optimization. Thoughts? Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers