Re: [HACKERS] WIP: Collecting statistics on CSV file data

2012-04-08 Thread Etsuro Fujita
Thanks!

Best regards,
Etsuro Fujita

-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
Sent: Saturday, April 07, 2012 4:20 AM
To: Shigeru HANADA
Cc: Etsuro Fujita; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] WIP: Collecting statistics on CSV file data 

Shigeru HANADA shigeru.han...@gmail.com writes:
 Just after my post, Fujita-san posted another v7 patch[1], so I merged
 v7 patches into v8 patch.

I've committed a modified version of this, but right after pushing it I had
a better idea about what the AnalyzeForeignTable API should do.
An issue that I'd not previously figured out is how analysis of an
inheritance tree could deal with foreign-table members, because it wants to
estimate the members' sizes before collecting the actual sample rows.
However, given that we've got the work split into a precheck phase and a
sample collection phase, that's not hard to solve: we could insist that the
FDW give back a size estimate in the precheck phase not the sample
collection phase.  I'm off to fix that up ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2012-04-06 Thread Tom Lane
Shigeru HANADA shigeru.han...@gmail.com writes:
 Just after my post, Fujita-san posted another v7 patch[1], so I merged
 v7 patches into v8 patch.

I've committed a modified version of this, but right after pushing it
I had a better idea about what the AnalyzeForeignTable API should do.
An issue that I'd not previously figured out is how analysis of an
inheritance tree could deal with foreign-table members, because it wants
to estimate the members' sizes before collecting the actual sample rows.
However, given that we've got the work split into a precheck phase and
a sample collection phase, that's not hard to solve: we could insist
that the FDW give back a size estimate in the precheck phase not the
sample collection phase.  I'm off to fix that up ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2012-04-06 Thread Shigeru Hanada
On Sat, Apr 7, 2012 at 4:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Shigeru HANADA shigeru.han...@gmail.com writes:
 Just after my post, Fujita-san posted another v7 patch[1], so I merged
 v7 patches into v8 patch.

 I've committed a modified version of this, but right after pushing it
 I had a better idea about what the AnalyzeForeignTable API should do.
 An issue that I'd not previously figured out is how analysis of an
 inheritance tree could deal with foreign-table members, because it wants
 to estimate the members' sizes before collecting the actual sample rows.
 However, given that we've got the work split into a precheck phase and
 a sample collection phase, that's not hard to solve: we could insist
 that the FDW give back a size estimate in the precheck phase not the
 sample collection phase.  I'm off to fix that up ...

Thanks.  I'll improve pgsql_fdw so that it can collect statistics of
foreign data with this new API.

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2012-04-05 Thread Etsuro Fujita
Thanks, Hanada-san!

Best regards,
Etsuro Fujita

-Original Message-
From: Shigeru HANADA [mailto:shigeru.han...@gmail.com] 
Sent: Friday, April 06, 2012 11:41 AM
To: Etsuro Fujita
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] WIP: Collecting statistics on CSV file data

(2012/04/05 21:10), Shigeru HANADA wrote:
 file_fdw
 
 This patch contains a use case of new handler function in 
 contrib/file_fdw.  Since file_fdw reads data from a flat file, 
 fileAnalyzeForeignTable uses similar algorithm to ordinary tables;  it 
 samples first N rows first, and replaces them randomly with subsequent 
 rows.  Also file_fdw updates pg_class.relpages by calculating number 
 of pages from size of the data file.
 
 To allow FDWs to implement sampling argorighm like this, several 
 functions are exported from analyze.c, e.g. random_fract, 
 init_selection_state, and get_next_S.

Just after my post, Fujita-san posted another v7 patch[1], so I merged
v7 patches into v8 patch.

[1] http://archives.postgresql.org/pgsql-hackers/2012-04/msg00212.php

Changes taken from Fujita-san's patch
=
* Remove reporting validrows and deadrows at the end of acquire_sample_rows
of file_fdw.  Thus, it doesn't validate NOT NULL constraints any more.
* Improve get_rel_size of file_fdw, which is used in GetForeignRelSize, to
estimate current # of tuples of the foreign table from these values.
  - # of pages/tuples which are updated by last ANALYZE
  - current file size

Additional Changes
==
* Fix memory leak in acquire_sample_rows which caused by calling
NextCopyFrom repeatedly in one long-span memory context.  I add per-record
temporary context and it's used during processing a record.
Main context is used to create heap tuples from sampled records, because
sample tuples must be valid after the function ends.
* Some cosmetic changes for document, e.g. remove line-break inside tagged
elements.
* Some cosmetic changes to make patch more readable by minimizing difference
from master branch.

Changes did *not* merged

* Fujita-san moved document of AnalyzeForeignTable to the section Foreign
Data Wrapper Helper Functions from Foreign Data Wrapper Callback
Routines.  But I think analyze handler is one of callback functions, though
it's optional.

Please find attached a patch.

Regards,
--
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2012-02-16 Thread Etsuro Fujita
Hi Hanada-san,

Sorry for the late response.

(2012/02/10 22:05), Shigeru Hanada wrote:
 (2011/12/15 11:30), Etsuro Fujita wrote:
 (2011/12/14 15:34), Shigeru Hanada wrote:
 I think this patch could be marked as Ready for committer with some
 minor fixes.  Please find attached a revised patch (v6.1).
 
 I've tried to make pgsql_fdw work with this feature, and found that few
 static functions to be needed to exported to implement ANALYZE handler
 in short-cut style.  The Short-cut style means the way to generate
 statistics (pg_class and pg_statistic) for foreign tables without
 retrieving sample data from foreign server.

That's great!  Here is my review.

The patch applies with some modifications and compiles cleanly.  But
regression tests on subqueries failed in addition to role related tests
as discussed earlier.

While I've not looked at the patch in detail, I have some comments:

1. The patch might need codes to handle the irregular case where
ANALYZE-related catalog data such as attstattarget are different between
the local and the remote. (Although we don't have the options to set
such a data on a foreign table in ALTER FOREIGN TABLE.)  For example,
while attstattarget = -1 for some column on the local, attstattarget = 0
for that column on the remote meaning that there can be no stats
available for that column.  In such a case it would be better to inform
the user of it.

2. It might be better for the FDW to estimate the costs of a remote
query for itself without doing EXPLAIN if stats were available using
this feature.  While this approach is less accurate compared to the
EXPLAIN approach due to the lack of information such as seq_page_cost or
randam_page_cost on the remote, it is cheaper!  I think such a
information may be added to generic options for a foreign table, which
may have been previously discussed.

3.
 In implementing ANALYZE handler, hardest part was copying anyarray
 values from remote to local.  If we can make it common in core, it would
 help FDW authors who want to implement ANALYZE handler without
 retrieving sample rows from remote server.

+1 from me.

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2012-02-10 Thread Shigeru Hanada
(2011/12/15 11:30), Etsuro Fujita wrote:
 (2011/12/14 15:34), Shigeru Hanada wrote:
 I think this patch could be marked as Ready for committer with some
 minor fixes.  Please find attached a revised patch (v6.1).

I've tried to make pgsql_fdw work with this feature, and found that few
static functions to be needed to exported to implement ANALYZE handler
in short-cut style.  The Short-cut style means the way to generate
statistics (pg_class and pg_statistic) for foreign tables without
retrieving sample data from foreign server.

Attached patch (export_funcs.patch) exports examine_attribute and
update_attstats which are necessary to implement ANALYZE handler for
pgsql_fdw.  In addition to exporting, update_attstats is also renamed to
vac_update_attstats to fit with already exported function
vac_update_relstats.

I also attached archive of WIP pgsql_fdw with ANALYZE support.  This
version has better estimation than original pgsql_fdw, because it can
use selectivity of qualifiers evaluated on local side to estimate number
of result rows.  To show the effect of ANALYZE clearly, WHERE push-down
feature is disabled.  Please see pgsqlAnalyzeForeignTable and
store_remote_stats in pgsql_fdw.c.

I used pgbench_accounts tables with 3 records, and got reasonable
rows estimation for queries below.

on remote side
postgres=# UPDATE pgbench_accounts SET filler = NULL
postgres-# WHERE aid % 3 = 0;
postgres=# ANALYZE;

on local side
postgres=# ANALYZE pgbench_accounts;  -- needs explicit table name
postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE filler IS NULL;
 QUERY PLAN

 Foreign Scan on pgbench_accounts  (cost=100.00..40610.00 rows=100030
width=97)
   Filter: (filler IS NULL)
   Remote SQL: DECLARE pgsql_fdw_cursor_13 SCROLL CURSOR FOR SELECT aid,
bid, abalance, filler FROM public.pgbench_accounts
(3 rows)

postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE aid  100;
 QUERY PLAN

 Foreign Scan on pgbench_accounts  (cost=100.00..40610.00 rows=96 width=97)
   Filter: (aid  100)
   Remote SQL: DECLARE pgsql_fdw_cursor_14 SCROLL CURSOR FOR SELECT aid,
bid, abalance, filler FROM public.pgbench_accounts
(3 rows)

postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE aid  1000;
 QUERY PLAN

 Foreign Scan on pgbench_accounts  (cost=100.00..40610.00 rows=1004
width=97)
   Filter: (aid  1000)
   Remote SQL: DECLARE pgsql_fdw_cursor_15 SCROLL CURSOR FOR SELECT aid,
bid, abalance, filler FROM public.pgbench_accounts
(3 rows)

In implementing ANALYZE handler, hardest part was copying anyarray
values from remote to local.  If we can make it common in core, it would
help FDW authors who want to implement ANALYZE handler without
retrieving sample rows from remote server.

Regards,
-- 
Shigeru Hanada
commit bb28cb5a69aae3bd9c7fbebc8b9483d23711bec4
Author: Shigeru Hanada shigeru.han...@gmail.com
Date:   Thu Feb 9 16:06:14 2012 +0900

Export functions which are useful for FDW analyze support.

Export examine_attribute and update_attstats (with renaming to
vac_update_attstats) which are useful (and nealy required) to implement
short-cut version of ANALYZE handler in FDWs.

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 6a22d49..d0a323a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -94,8 +94,6 @@ static void compute_index_stats(Relation onerel, double 
totalrows,
AnlIndexData *indexdata, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context);
-static VacAttrStats *examine_attribute(Relation onerel, int attnum,
- Node *index_expr);
 static int acquire_sample_rows(Relation onerel,
   HeapTuple *rows, int 
targrows,
   double *totalrows, 
double *totaldeadrows,
@@ -105,8 +103,6 @@ static int acquire_inherited_sample_rows(Relation onerel,
  double *totalrows, 
double *totaldeadrows,
  BlockNumber 
*totalpages, int elevel);
 static int compare_rows(const void *a, const void *b);
-static void update_attstats(Oid relid, bool inh,
-   int natts, VacAttrStats **vacattrstats);
 static Datum 

Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-12-14 Thread Etsuro Fujita
(2011/12/14 15:34), Shigeru Hanada wrote:
 (2011/12/13 22:00), Etsuro Fujita wrote:
 Thank you for your effectiveness experiments and proposals for
 improvements.  I updated the patch according to your proposals.
 Attached is the updated version of the patch.
 
 I think this patch could be marked as Ready for committer with some
 minor fixes.  Please find attached a revised patch (v6.1).

Many thanks.

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-12-12 Thread Shigeru Hanada
(2011/12/09 21:16), Etsuro Fujita wrote:
 I updated the patch.  Please find attached a patch.

I've examined v5 patch, and got reasonable EXPLAIN results which reflect
collected statistics!  As increasing STATISTICS option, estimated rows
become better.  Please see attached stats_*.txt for what I
tested.

  stats_none.txt  : before ANALYZE
  stats_100.txt   : SET STATISTICS = 100 for all columns, and ANALYZE
  stats_1.txt : SET STATISTICS = 1 for all columns, and ANALYZE

I think that this patch become ready for committer after some
minor corrections:

Couldn't set n_distinct
===
I couldn't set n_distinct to columns of foreign tables.  With some
research, I noticed that ATSimplePermissions should accept
ATT_FOREIGN_TABLE too for that case.  In addition, regression tests for
ALTER FOREIGN TABLE should be added to detect this kind of problem.

Showing stats target

We can see stats target of ordinary tables with \d+, but it is not
available for foreign tables.  I think Stats target column should be
added even though output of \d+ for foreign tables become wider.  One
possible idea is to remove useless Storage column instead, but views
have that column even though their source could come from foreign tables.

Please see attached patch for these two items.

Comments of FdwRoutine
==
Mention about optional handler is obsolete.  We should clearly say
AnalyzeForeignTable is optional (can be set to NULL) and rest are
required.  IMO separating them with comment would help FDW authors to
understand requirements, e.g.:

typedef struct FdwRoutine
{
NodeTag type;

/*
 * These Handlers are required to execute simple scan on a foreign
 * table.  If any of them was set to NULL, scan on a foreign table
 * managed by such FDW would fail.
 */
PlanForeignScan_function PlanForeignScan;
ExplainForeignScan_function ExplainForeignScan;
BeginForeignScan_function BeginForeignScan;
IterateForeignScan_function IterateForeignScan;
ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_function EndForeignScan;

/*
 * Handlers below are optional.  You can set any of them to
 * NULL to tell PostgreSQL that the FDW doesn't have the
 * capability.
 */
AnalyzeForeignTable_function AnalyzeForeignTable;
} FdwRoutine;

Code formatting
===
Some code lines go past 80 columns.

Message style
=
The terms 'cannot support option n_distinct...' used in
ATPrepSetOptions seems little unusual in PostgreSQL.  Should we say
'cannot set n_distinct_inherited for foreign tables' for that case?

Typo

Typo spcify is in document of analyze.

Regards,
-- 
Shigeru Hanada

postgres=# explain select * from csv_accounts where aid = 1;
   QUERY PLAN
-
 Foreign Scan on csv_accounts  (cost=0.00..44262.88 rows=1874 width=100)
   Filter: (aid = 1)
   Foreign File: /home/hanada/DB-fujita/CSV/pgbench_accounts.csv
   Foreign File Size: 47963070
(4 rows)

postgres=# explain select * from csv_accounts where aid  1000;
QUERY PLAN
---
 Foreign Scan on csv_accounts  (cost=0.00..44262.88 rows=124904 width=100)
   Filter: (aid  1000)
   Foreign File: /home/hanada/DB-fujita/CSV/pgbench_accounts.csv
   Foreign File Size: 47963070
(4 rows)

postgres=# explain select * from csv_accounts where aid  1000;
QUERY PLAN
---
 Foreign Scan on csv_accounts  (cost=0.00..44262.88 rows=124904 width=100)
   Filter: (aid  1000)
   Foreign File: /home/hanada/DB-fujita/CSV/pgbench_accounts.csv
   Foreign File Size: 47963070
(4 rows)

postgres=# explain select * from csv_accounts where bid = 1;
   QUERY PLAN
-
 Foreign Scan on csv_accounts  (cost=0.00..44262.88 rows=1874 width=100)
   Filter: (bid = 1)
   Foreign File: /home/hanada/DB-fujita/CSV/pgbench_accounts.csv
   Foreign File Size: 47963070
(4 rows)

postgres=# explain select * from csv_accounts where bid = 2;
   QUERY PLAN
-
 Foreign Scan on csv_accounts  (cost=0.00..44262.88 rows=1874 width=100)
   Filter: (bid = 2)
   Foreign File: /home/hanada/DB-fujita/CSV/pgbench_accounts.csv
   Foreign File Size: 47963070
(4 rows)

postgres=# explain select * from csv_accounts where bid = 4;
QUERY PLAN
--
 Foreign Scan on csv_accounts  (cost=0.00..57105.00 rows=102117 width=97)
   Filter: (bid = 4)
   Foreign File: /home/hanada/DB-fujita/CSV/pgbench_accounts.csv
   Foreign File 

Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-12-09 Thread Etsuro Fujita
Hi Hanada-san,

I updated the patch.  Please find attached a patch.

Best regards,
Etsuro Fujita

 (2011/11/18 21:00), Shigeru Hanada wrote:
 (2011/11/18 16:25), Etsuro Fujita wrote:
 Thank you for your testing.  I updated the patch according to your
 comments.  Attached is the updated version of the patch.

 I'd like to share result of my review even though it's not fully
 finished.  So far I looked from viewpoint of API design, code
 formatting, and documentation.  I'll examine effectiveness of the patch
 and details of implementation next week, and hopefully try writing
 ANALYZE handler for pgsql_fdw :)

 New patch has correct format, and it could be applied to HEAD of master
 branch cleanly.  All regression tests including those of contrib modules
 have passed.  It contains changes of codes and regression tests related
 to the issue, and they have enough comments.  IMO the document in this
 patch is not enough to show how to write analyze handler to FDW authors,
 but it can be enhanced afterward.  With this patch, FDW author can
 provide optional ANALYZE handler which updates statistics of foreign
 tables.  Planner would be able to generate better plan by using statistics.

 Yes.  But in the updated version, I've refactored analyze.c a little bit
 to allow FDW authors to simply call do_analyze_rel().
 snip
 The updated version enables FDW authors to just write their own
 acquire_sample_rows().  On the other hand, by retaining to hook
 AnalyzeForeignTable routine at analyze_rel(), higher level than
 acquire_sample_rows() as before, it allows FDW authors to write
 AnalyzeForeignTable routine for foreign tables on a remote server to ask
 the server for its current stats instead, as pointed out earlier by Tom
 Lane.

 IIUC, this patch offers three options to FDWs: a) set
 AnalyzeForeignTable to NULL to indicate lack of capability, b) provide
 AnalyzeForeignTable which calls do_analyze_rel with custom
 sample_row_acquirer, and c) create statistics data from scratch by FDW
 itself by doing similar things to do_analyze_rel with given argument or
 copying statistics from remote PostgreSQL server.

 ISTM that this design is well-balanced between simplicity and
 flexibility.  Maybe these three options would suit web-based wrappers,
 file-based or RDBMS wrappers, and pgsql_fdw respectively.  I think that
 adding more details of FdwRoutine, such as purpose of new callback
 function and difference from required ones, would help FDW authors,
 including me :)

 I have some random comments:

 - I think separated typedef of sample_acquire_rows would make codes more
 readable.  In addition, parameters of the function should be described
 explicitly.
 - I couldn't see the reason why file_fdw sets ctid of sample tuples,
 though I guess it's for Vitter's random sampling algorithm.  If every
 FDW must set valid ctid to sample tuples, it should be mentioned in
 document of AnalyzeForeignTable.  Exporting some functions from
 analyze.c relates this issue?
 - Why file_fdw skips sample tuples which have NULL value?  AFAIS
 original acquire_sample_rows doesn't do so.
 - Some comment lines go past 80 columns.
 - Some headers included in file_fdw.c seems unnecessary.

 Regards,
 
 

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 20,25 
--- 20,26 
  #include commands/copy.h
  #include commands/defrem.h
  #include commands/explain.h
+ #include commands/vacuum.h
  #include foreign/fdwapi.h
  #include foreign/foreign.h
  #include miscadmin.h
***
*** 101,106  static void fileBeginForeignScan(ForeignScanState *node, int 
eflags);
--- 102,108 
  static TupleTableSlot *fileIterateForeignScan(ForeignScanState *node);
  static void fileReScanForeignScan(ForeignScanState *node);
  static void fileEndForeignScan(ForeignScanState *node);
+ static void fileAnalyzeForeignTable(Relation onerel, VacuumStmt *vacstmt, int 
elevel);
  
  /*
   * Helper functions
***
*** 112,118  static List *get_file_fdw_attribute_options(Oid relid);
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! 
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
--- 114,123 
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! static intacquire_sample_rows(Relation onerel,
!  HeapTuple *rows, int targrows,
!  double *totalrows, double *totaldeadrows,
!  BlockNumber *totalpages, int elevel);
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
***
*** 129,134  file_fdw_handler(PG_FUNCTION_ARGS)
--- 134,140 
fdwroutine-IterateForeignScan = fileIterateForeignScan;
 

Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-11-20 Thread Etsuro Fujita
Hi Hanada-san,

Thank you for your valuable comments.  I will improve the items pointed
out by you at the next version of the patch, including documentation on
the purpose of AnalyzeForeignTable, how to write it, and so on.  Here I
comment only one point:

 - Why file_fdw skips sample tuples which have NULL value?  AFAIS
 original acquire_sample_rows doesn't do so.

To be precise, I've implemented to skip tuples that have null values in
certain column(s) but that are not allowed to contain null values in
that(those) column(s) by NOT NULL constrain.  file_fdw's
sample_row_acquirer considers those tuples as dead tuples.  This is
for the consistency with NOT NULL constrain.  (But I don't know why
fileIterateForeignScan routine allows such dead tuples.  I may have
missed something.)

Best regards,
Etsuro Fujita

(2011/11/18 21:00), Shigeru Hanada wrote:
 (2011/11/18 16:25), Etsuro Fujita wrote:
 Thank you for your testing.  I updated the patch according to your
 comments.  Attached is the updated version of the patch.
 
 I'd like to share result of my review even though it's not fully
 finished.  So far I looked from viewpoint of API design, code
 formatting, and documentation.  I'll examine effectiveness of the patch
 and details of implementation next week, and hopefully try writing
 ANALYZE handler for pgsql_fdw :)
 
 New patch has correct format, and it could be applied to HEAD of master
 branch cleanly.  All regression tests including those of contrib modules
 have passed.  It contains changes of codes and regression tests related
 to the issue, and they have enough comments.  IMO the document in this
 patch is not enough to show how to write analyze handler to FDW authors,
 but it can be enhanced afterward.  With this patch, FDW author can
 provide optional ANALYZE handler which updates statistics of foreign
 tables.  Planner would be able to generate better plan by using statistics.
 
 Yes.  But in the updated version, I've refactored analyze.c a little bit
 to allow FDW authors to simply call do_analyze_rel().
 snip
 The updated version enables FDW authors to just write their own
 acquire_sample_rows().  On the other hand, by retaining to hook
 AnalyzeForeignTable routine at analyze_rel(), higher level than
 acquire_sample_rows() as before, it allows FDW authors to write
 AnalyzeForeignTable routine for foreign tables on a remote server to ask
 the server for its current stats instead, as pointed out earlier by Tom
 Lane.
 
 IIUC, this patch offers three options to FDWs: a) set
 AnalyzeForeignTable to NULL to indicate lack of capability, b) provide
 AnalyzeForeignTable which calls do_analyze_rel with custom
 sample_row_acquirer, and c) create statistics data from scratch by FDW
 itself by doing similar things to do_analyze_rel with given argument or
 copying statistics from remote PostgreSQL server.
 
 ISTM that this design is well-balanced between simplicity and
 flexibility.  Maybe these three options would suit web-based wrappers,
 file-based or RDBMS wrappers, and pgsql_fdw respectively.  I think that
 adding more details of FdwRoutine, such as purpose of new callback
 function and difference from required ones, would help FDW authors,
 including me :)
 
 I have some random comments:
 
 - I think separated typedef of sample_acquire_rows would make codes more
 readable.  In addition, parameters of the function should be described
 explicitly.
 - I couldn't see the reason why file_fdw sets ctid of sample tuples,
 though I guess it's for Vitter's random sampling algorithm.  If every
 FDW must set valid ctid to sample tuples, it should be mentioned in
 document of AnalyzeForeignTable.  Exporting some functions from
 analyze.c relates this issue?
 - Why file_fdw skips sample tuples which have NULL value?  AFAIS
 original acquire_sample_rows doesn't do so.
 - Some comment lines go past 80 columns.
 - Some headers included in file_fdw.c seems unnecessary.
 
 Regards,


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-11-20 Thread Etsuro Fujita

(2011/11/19 0:54), Robert Haas wrote:

2011/11/18 Shigeru Hanadashigeru.han...@gmail.com:

- I couldn't see the reason why file_fdw sets ctid of sample tuples,
though I guess it's for Vitter's random sampling algorithm.  If every
FDW must set valid ctid to sample tuples, it should be mentioned in
document of AnalyzeForeignTable.  Exporting some functions from
analyze.c relates this issue?


If every FDW must set valid ctid to sample tuples, it should be fixed
so that they don't have to, I would think.


It's for neither Vitter's algorithm nor exporting functions from 
analyze.c.  It's for foreign index scan on CSV file data that I plan 
to propose in the next CF.  So, it is meaningless for now.  I'm sorry. 
I will fix it at the next version of the patch so that they don't have to.


Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-11-18 Thread Shigeru Hanada
(2011/11/18 16:25), Etsuro Fujita wrote:
 Thank you for your testing.  I updated the patch according to your
 comments.  Attached is the updated version of the patch.

I'd like to share result of my review even though it's not fully
finished.  So far I looked from viewpoint of API design, code
formatting, and documentation.  I'll examine effectiveness of the patch
and details of implementation next week, and hopefully try writing
ANALYZE handler for pgsql_fdw :)

New patch has correct format, and it could be applied to HEAD of master
branch cleanly.  All regression tests including those of contrib modules
have passed.  It contains changes of codes and regression tests related
to the issue, and they have enough comments.  IMO the document in this
patch is not enough to show how to write analyze handler to FDW authors,
but it can be enhanced afterward.  With this patch, FDW author can
provide optional ANALYZE handler which updates statistics of foreign
tables.  Planner would be able to generate better plan by using statistics.

 Yes.  But in the updated version, I've refactored analyze.c a little bit
 to allow FDW authors to simply call do_analyze_rel().
snip
 The updated version enables FDW authors to just write their own
 acquire_sample_rows().  On the other hand, by retaining to hook
 AnalyzeForeignTable routine at analyze_rel(), higher level than
 acquire_sample_rows() as before, it allows FDW authors to write
 AnalyzeForeignTable routine for foreign tables on a remote server to ask
 the server for its current stats instead, as pointed out earlier by Tom
 Lane.

IIUC, this patch offers three options to FDWs: a) set
AnalyzeForeignTable to NULL to indicate lack of capability, b) provide
AnalyzeForeignTable which calls do_analyze_rel with custom
sample_row_acquirer, and c) create statistics data from scratch by FDW
itself by doing similar things to do_analyze_rel with given argument or
copying statistics from remote PostgreSQL server.

ISTM that this design is well-balanced between simplicity and
flexibility.  Maybe these three options would suit web-based wrappers,
file-based or RDBMS wrappers, and pgsql_fdw respectively.  I think that
adding more details of FdwRoutine, such as purpose of new callback
function and difference from required ones, would help FDW authors,
including me :)

I have some random comments:

- I think separated typedef of sample_acquire_rows would make codes more
readable.  In addition, parameters of the function should be described
explicitly.
- I couldn't see the reason why file_fdw sets ctid of sample tuples,
though I guess it's for Vitter's random sampling algorithm.  If every
FDW must set valid ctid to sample tuples, it should be mentioned in
document of AnalyzeForeignTable.  Exporting some functions from
analyze.c relates this issue?
- Why file_fdw skips sample tuples which have NULL value?  AFAIS
original acquire_sample_rows doesn't do so.
- Some comment lines go past 80 columns.
- Some headers included in file_fdw.c seems unnecessary.

Regards,
-- 
Shigeru Hanada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-11-18 Thread Robert Haas
2011/11/18 Shigeru Hanada shigeru.han...@gmail.com:
 - I couldn't see the reason why file_fdw sets ctid of sample tuples,
 though I guess it's for Vitter's random sampling algorithm.  If every
 FDW must set valid ctid to sample tuples, it should be mentioned in
 document of AnalyzeForeignTable.  Exporting some functions from
 analyze.c relates this issue?

If every FDW must set valid ctid to sample tuples, it should be fixed
so that they don't have to, I would think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-11-17 Thread Etsuro Fujita
(2011/11/07 20:26), Shigeru Hanada wrote:
 (2011/10/20 18:56), Etsuro Fujita wrote:
 I revised the patch according to Hanada-san's comments. Attached is the
 updated version of the patch.

 Changes:

 * pull up of logging analyzing foo.bar
 * new vac_update_relstats always called
 * tab-completion in psql
 * add foreign tables are not analyzed automatically... to 23.1.3
 Updating Planner Statistics
 * some other modifications
 
 Submission review
 =
 
 - Patch can be applied, and all regression tests passed. :)

Thank you for your testing.  I updated the patch according to your
comments.  Attached is the updated version of the patch.

 - file_fdw_do_analyze_rel is almost copy of do_analyze_rel.  IIUC,
 difference against do_analyze_rel are:
  * don't logging analyze target
  * don't switch userid to the owner of target table
  * don't measure elapsed time for autoanalyze deamon
  * don't handle index
  * some comments are removed.
  * sample rows are acquired by file_fdw's routine
 
 I don't see any problem here, but would you confirm that all of them are
 intentional?

Yes.  But in the updated version, I've refactored analyze.c a little bit
to allow FDW authors to simply call do_analyze_rel().

 - In your design, each FDW have to copy most of do_analyze_rel to their
 own source.  It means that FDW authors must know much details of ANALYZE
 to implement ANALYZE handler.  Actually, your patch exports some static
 functions from analyze.c.  Have you considered hooking
 acquire_sample_rows()?  Such handler should be more simple, and
 FDW-specific.  As you say, such design requires FDWs to skip some
 records, but it would be difficult for some FDW (e.g. twitter_fdw) which
 can't pick sample data up easily.  IMHO such problem *must* be solved by
 FDW itself.

The updated version enables FDW authors to just write their own
acquire_sample_rows().  On the other hand, by retaining to hook
AnalyzeForeignTable routine at analyze_rel(), higher level than
acquire_sample_rows() as before, it allows FDW authors to write
AnalyzeForeignTable routine for foreign tables on a remote server to ask
the server for its current stats instead, as pointed out earlier by Tom
Lane.

Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 16,31 
  #include unistd.h
  
  #include access/reloptions.h
  #include catalog/pg_foreign_table.h
  #include commands/copy.h
  #include commands/defrem.h
  #include commands/explain.h
  #include foreign/fdwapi.h
  #include foreign/foreign.h
  #include miscadmin.h
  #include nodes/makefuncs.h
  #include optimizer/cost.h
! #include utils/rel.h
  #include utils/syscache.h
  
  PG_MODULE_MAGIC;
--- 16,36 
  #include unistd.h
  
  #include access/reloptions.h
+ #include access/transam.h
  #include catalog/pg_foreign_table.h
  #include commands/copy.h
  #include commands/defrem.h
  #include commands/explain.h
+ #include commands/vacuum.h
  #include foreign/fdwapi.h
  #include foreign/foreign.h
  #include miscadmin.h
  #include nodes/makefuncs.h
  #include optimizer/cost.h
! #include parser/parse_relation.h
! #include pgstat.h
! #include utils/attoptcache.h
! #include utils/memutils.h
  #include utils/syscache.h
  
  PG_MODULE_MAGIC;
***
*** 101,106  static void fileBeginForeignScan(ForeignScanState *node, int 
eflags);
--- 106,112 
  static TupleTableSlot *fileIterateForeignScan(ForeignScanState *node);
  static void fileReScanForeignScan(ForeignScanState *node);
  static void fileEndForeignScan(ForeignScanState *node);
+ static void fileAnalyzeForeignTable(Relation onerel, VacuumStmt *vacstmt, int 
elevel);
  
  /*
   * Helper functions
***
*** 112,118  static List *get_file_fdw_attribute_options(Oid relid);
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! 
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
--- 118,127 
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! static intacquire_sample_rows(Relation onerel,
!  HeapTuple *rows, int targrows,
!  double *totalrows, double *totaldeadrows,
!  BlockNumber *totalpages, int elevel);
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
***
*** 129,134  file_fdw_handler(PG_FUNCTION_ARGS)
--- 138,144 
fdwroutine-IterateForeignScan = fileIterateForeignScan;
fdwroutine-ReScanForeignScan = fileReScanForeignScan;
fdwroutine-EndForeignScan = fileEndForeignScan;
+   fdwroutine-AnalyzeForeignTable = fileAnalyzeForeignTable;
  

Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-11-07 Thread Shigeru Hanada
(2011/10/20 18:56), Etsuro Fujita wrote:
 I revised the patch according to Hanada-san's comments. Attached is the
 updated version of the patch.
 
 Changes:
 
* pull up of logging analyzing foo.bar
* new vac_update_relstats always called
* tab-completion in psql
* add foreign tables are not analyzed automatically... to 23.1.3
 Updating Planner Statistics
* some other modifications

Submission review
=

- Patch can be applied, and all regression tests passed. :)

Random comments
===

- Some headers are not necessary for file_fdw.c

#include access/htup.h
#include commands/dbcommands.h
#include optimizer/plancat.h
#include utils/elog.h
#include utils/guc.h
#include utils/lsyscache.h
#include utils/rel.h

- It might be better to mention in document that users need to
explicitly specify foreign table name to ANALYZE command.

- I think backend should be aware the case which a handler is NULL.  For
the case of ANALYZE of foreign table, it would be worth telling user
that request was not accomplished.

- file_fdw_do_analyze_rel is almost copy of do_analyze_rel.  IIUC,
difference against do_analyze_rel are:
* don't logging analyze target
* don't switch userid to the owner of target table
* don't measure elapsed time for autoanalyze deamon
* don't handle index
* some comments are removed.
* sample rows are acquired by file_fdw's routine

I don't see any problem here, but would you confirm that all of them are
intentional?

Besides, keeping format (mainly indent and folding) of this function
similar to do_analyze_rel would help to follow changes in do_analyze_rel.

- IMHO exporting CopyState should be avoided.  One possible idea is
adding new COPY API which allows to extract records from the file with
skipping specified number or rate.

- In your design, each FDW have to copy most of do_analyze_rel to their
own source.  It means that FDW authors must know much details of ANALYZE
to implement ANALYZE handler.  Actually, your patch exports some static
functions from analyze.c.  Have you considered hooking
acquire_sample_rows()?  Such handler should be more simple, and
FDW-specific.  As you say, such design requires FDWs to skip some
records, but it would be difficult for some FDW (e.g. twitter_fdw) which
can't pick sample data up easily.  IMHO such problem *must* be solved by
FDW itself.

-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-19 Thread Etsuro Fujita

Hi,

(2011/10/18 16:32), Leonardo Francalanci wrote:

New API AnalyzeForeignTable


I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an automatic analyze of foreign tables...


The patch does not analyze on foreign tables automatically. (The issue 
of auto-analyze on foreign tables has been discussed. Please refer to [1].)


[1] http://archives.postgresql.org/pgsql-hackers/2011-09/msg00992.php

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-18 Thread Leonardo Francalanci
 New API AnalyzeForeignTable



I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an automatic analyze of foreign tables...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-17 Thread Shigeru Hanada
(2011/10/07 18:09), Etsuro Fujita wrote:
 Thank you for the review and the helpful information.
 I rebased. Please find attached a patch. I'll add the patch to the next CF.
 
 Changes:
 
* cleanups and fixes
* addition of the following to ALTER FOREIGN TABLE
ALTER [COLUMN] column SET STATISTICS integer
ALTER [COLUMN] column SET ( n_distinct = val ) (n_distinct only)
ALTER [COLUMN] column RESET ( n_distinct )
* reflection of the force_not_null info in acquiring sample rows
* documentation

The new patch could be applied with some shifts.  Regression tests of
core and file_fdw have passed cleanly.  Though I've tested only simple
tests, ANALYZE works for foreign tables for file_fdw, and estimation of
costs and selectivity seem appropriate.

New API AnalyzeForeignTable
===
In your design, new handler function is called instead of
do_analylze_rel.  IMO this hook point would be good for FDWs which can
provide statistics in optimized way.  For instance, pgsql_fdw can simply
copy statistics from remote PostgreSQL server if they are compatible.
Possible another idea is to replace acquire_sample_rows so that other
FDWs can reuse most part of fileAnalyzeForeignTable and
file_fdw_do_analyze_rel.

And I think that AnalyzeForeignTable should be optional, and it would be
very useful if a default handler is provided.  Probably a default
handler can use basic FDW APIs to acquire sample rows from the result of
SELECT * FROM foreign_table with skipping periodically.  It won't be
efficient but I think it's not so unreasonable.

Other issues

There are some other comments about non-critical issues.
- When there is no analyzable column, vac_update_relstats is not called.
 Is this behavior intentional?
- psql can't complete foreign table name after ANALYZE.
- A new parameter has been added to vac_update_relstats in a recent
commit.  Perhaps 0 is OK for that parameter.
- ANALYZE without relation name ignores foreign tables because
get_rel_oids doesn't list foreign tables.
- IMO logging analyzing foo.bar should not be done in
AnalyzeForeignTable handler of each FDW because some FDW might forget to
do it.  Maybe it should be pulled up to analyze_rel or somewhere in core.
- It should be mentioned in a document that foreign tables are not
analyzed automatically because they are read-only.

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-17 Thread Etsuro Fujita
(2011/10/18 2:27), Shigeru Hanada wrote:
 The new patch could be applied with some shifts.  Regression tests of
 core and file_fdw have passed cleanly.  Though I've tested only simple
 tests, ANALYZE works for foreign tables for file_fdw, and estimation of
 costs and selectivity seem appropriate.

Thank you for your testing.

 New API AnalyzeForeignTable
 ===

 And I think that AnalyzeForeignTable should be optional, and it would be
 very useful if a default handler is provided.  Probably a default
 handler can use basic FDW APIs to acquire sample rows from the result of
 SELECT * FROM foreign_table with skipping periodically.  It won't be
 efficient but I think it's not so unreasonable.

I agree with you. However, I think that it is difficult to support such
a default handler in a unified way because there exist external data
sources for which we cannot execute SELECT * FROM foreign_table, e.g.,
web-accessible DBs limiting full access to the contents.

 Other issues
 
 There are some other comments about non-critical issues.
 - When there is no analyzable column, vac_update_relstats is not called.
   Is this behavior intentional?
 - psql can't complete foreign table name after ANALYZE.
 - A new parameter has been added to vac_update_relstats in a recent
 commit.  Perhaps 0 is OK for that parameter.

I'll check.

 - ANALYZE without relation name ignores foreign tables because
 get_rel_oids doesn't list foreign tables.

I think that it might be better to ignore foreign tables by default
because analyzing such tables may take long depending on FDW.

 - IMO logging analyzing foo.bar should not be done in
 AnalyzeForeignTable handler of each FDW because some FDW might forget to
 do it.  Maybe it should be pulled up to analyze_rel or somewhere in core.
 - It should be mentioned in a document that foreign tables are not
 analyzed automatically because they are read-only.

OK. I'll revise.

 Regards,

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-10 Thread Etsuro Fujita

(2011/10/07 21:56), David Fetter wrote:

(But this is BTW. I'm interested in developing CREATE FOREIGN INDEX.
I've examined whether there are discussions about the design and
implementation of it in the archive, but could not find information.
If you know anything, please tell me.)


Look into the virtual index interface from Informix.


Thank you for the information.


We might want to start a wiki page on this.


Yeah, I think it might be better to add information to the SQL/MED wiki 
page:


http://wiki.postgresql.org/wiki/SQL/MED

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-07 Thread Etsuro Fujita
Hi Hanada-san,

I'm very sorry for late reply.

(2011/09/20 18:49), Shigeru Hanada wrote:
 I took a look at the patch, and found that it couldn't be applied
 cleanly against HEAD.  Please rebase your patch against current HEAD of
 master branch, rather than 9.1beta1.
 
 The wiki pages below would be helpful for you.
http://wiki.postgresql.org/wiki/Submitting_a_Patch
http://wiki.postgresql.org/wiki/Creating_Clean_Patches
http://wiki.postgresql.org/wiki/Reviewing_a_Patch
 
 And it would be easy to use git to follow changes made by other
 developers in master branch.
 http://wiki.postgresql.org/wiki/Working_with_Git

Thank you for the review and the helpful information.
I rebased. Please find attached a patch. I'll add the patch to the next CF.

Changes:

  * cleanups and fixes
  * addition of the following to ALTER FOREIGN TABLE
  ALTER [COLUMN] column SET STATISTICS integer
  ALTER [COLUMN] column SET ( n_distinct = val ) (n_distinct only)
  ALTER [COLUMN] column RESET ( n_distinct )
  * reflection of the force_not_null info in acquiring sample rows
  * documentation

Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 15,30 
--- 15,42 
  #include sys/stat.h
  #include unistd.h
  
+ #include access/htup.h
  #include access/reloptions.h
+ #include access/transam.h
  #include catalog/pg_foreign_table.h
  #include commands/copy.h
+ #include commands/dbcommands.h
  #include commands/defrem.h
  #include commands/explain.h
+ #include commands/vacuum.h
  #include foreign/fdwapi.h
  #include foreign/foreign.h
  #include miscadmin.h
  #include nodes/makefuncs.h
  #include optimizer/cost.h
+ #include optimizer/plancat.h
+ #include parser/parse_relation.h
+ #include pgstat.h
+ #include utils/attoptcache.h
+ #include utils/elog.h
+ #include utils/guc.h
+ #include utils/lsyscache.h
+ #include utils/memutils.h
  #include utils/rel.h
  #include utils/syscache.h
  
***
*** 101,106  static void fileBeginForeignScan(ForeignScanState *node, int 
eflags);
--- 113,119 
  static TupleTableSlot *fileIterateForeignScan(ForeignScanState *node);
  static void fileReScanForeignScan(ForeignScanState *node);
  static void fileEndForeignScan(ForeignScanState *node);
+ static void fileAnalyzeForeignTable(Relation onerel, VacuumStmt *vacstmt, int 
elevel);
  
  /*
   * Helper functions
***
*** 112,118  static List *get_file_fdw_attribute_options(Oid relid);
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! 
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
--- 125,132 
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! static void file_fdw_do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, int 
elevel);
! static int  file_fdw_acquire_sample_rows(Relation onerel, int elevel, 
HeapTuple *rows, int targrows, BlockNumber *totalpages, double *totalrows);
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
***
*** 129,134  file_fdw_handler(PG_FUNCTION_ARGS)
--- 143,149 
fdwroutine-IterateForeignScan = fileIterateForeignScan;
fdwroutine-ReScanForeignScan = fileReScanForeignScan;
fdwroutine-EndForeignScan = fileEndForeignScan;
+   fdwroutine-AnalyzeForeignTable = fileAnalyzeForeignTable;
  
PG_RETURN_POINTER(fdwroutine);
  }
***
*** 575,580  fileReScanForeignScan(ForeignScanState *node)
--- 590,605 
  }
  
  /*
+  * fileAnalyzeForeignTable
+  *Analyze table
+  */
+ static void
+ fileAnalyzeForeignTable(Relation onerel, VacuumStmt *vacstmt, int elevel)
+ {
+   file_fdw_do_analyze_rel(onerel, vacstmt, elevel);
+ }
+ 
+ /*
   * Estimate costs of scanning a foreign table.
   */
  static void
***
*** 584,590  estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  {
struct stat stat_buf;
BlockNumber pages;
!   int tuple_width;
double  ntuples;
double  nrows;
Costrun_cost = 0;
--- 609,616 
  {
struct stat stat_buf;
BlockNumber pages;
!   BlockNumber relpages;
!   double  reltuples;
double  ntuples;
double  nrows;
Costrun_cost = 0;
***
*** 604,619  estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
if (pages  1)
pages = 1;
  
!   /*
!* Estimate the number of tuples in the file.  We back into this 
estimate
!* using the planner's idea of the relation width; which is bogus if not
!* all columns are being read, not to mention that 

Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-07 Thread Etsuro Fujita

Hi,

I'm very sorry for the late reply.

(2011/09/21 10:00), Alvaro Herrera wrote:

Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011:

On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:



Probably a more interesting question is why we wouldn't change
autovacuum so that it calls this automatically for foreign tables.


How about a per-table setting that tells autovacuum whether to do
this?


Seems reasonable.  Have autovacuum assume that foreign tables are not to
be analyzed, unless some reloption is set.


Thank you for the comments. I'd like to leave that feature for future work.

(But this is BTW. I'm interested in developing CREATE FOREIGN INDEX. 
I've examined whether there are discussions about the design and 
implementation of it in the archive, but could not find information. If 
you know anything, please tell me.)


Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-07 Thread David Fetter
On Fri, Oct 07, 2011 at 08:09:44PM +0900, Etsuro Fujita wrote:
 Hi,
 
 I'm very sorry for the late reply.
 
 (2011/09/21 10:00), Alvaro Herrera wrote:
 Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011:
 On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:
 
 Probably a more interesting question is why we wouldn't change
 autovacuum so that it calls this automatically for foreign tables.
 
 How about a per-table setting that tells autovacuum whether to do
 this?
 
 Seems reasonable.  Have autovacuum assume that foreign tables are not to
 be analyzed, unless some reloption is set.
 
 Thank you for the comments. I'd like to leave that feature for future work.

OK

 (But this is BTW. I'm interested in developing CREATE FOREIGN INDEX.
 I've examined whether there are discussions about the design and
 implementation of it in the archive, but could not find information.
 If you know anything, please tell me.)

Look into the virtual index interface from Informix.  We might want
to start a wiki page on this.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-09-20 Thread Shigeru Hanada
Hi Fujita-san,

(2011/09/12 19:40), Etsuro Fujita wrote:
 Hi there,
 
 To enable file_fdw to estimate costs of scanning a CSV file more
 accurately, I would like to propose a new FDW callback routine,
 AnalyzeForeignTable, which allows to ANALYZE command to collect
 statistics on a foreign table, and a corresponding file_fdw function,
 fileAnalyzeForeignTable. Attached is my WIP patch.
snip

I think this is a very nice feature so that planner would be able to
create smarter plan for a query which uses foreign tables.

I took a look at the patch, and found that it couldn't be applied
cleanly against HEAD.  Please rebase your patch against current HEAD of
master branch, rather than 9.1beta1.

The wiki pages below would be helpful for you.
  http://wiki.postgresql.org/wiki/Submitting_a_Patch
  http://wiki.postgresql.org/wiki/Creating_Clean_Patches
  http://wiki.postgresql.org/wiki/Reviewing_a_Patch

And it would be easy to use git to follow changes made by other
developers in master branch.
   http://wiki.postgresql.org/wiki/Working_with_Git

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-09-20 Thread Marti Raudsepp
2011/9/12 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
 This is called when ANALYZE command is executed. (ANALYZE
 command should be executed because autovacuum does not analyze foreign
 tables.)

This is a good idea.

However, if adding these statistics requires an explicit ANALYZE
command, then we should also have a command for resetting the
collected statistics -- to get it back into the un-analyzed state.

Currently it looks like the only way to reset statistics is to tamper
with catalogs directly, or recreate the foreign table.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-09-20 Thread Tom Lane
Marti Raudsepp ma...@juffo.org writes:
 2011/9/12 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
 This is called when ANALYZE command is executed. (ANALYZE
 command should be executed because autovacuum does not analyze foreign
 tables.)

 This is a good idea.

 However, if adding these statistics requires an explicit ANALYZE
 command, then we should also have a command for resetting the
 collected statistics -- to get it back into the un-analyzed state.

Uh, why?  There is no UNANALYZE operation for ordinary tables, and
I've never heard anyone ask for one.

If you're desperate you could manually delete the relevant rows in
pg_statistic, a solution that would presumably work for foreign tables
too.

Probably a more interesting question is why we wouldn't change
autovacuum so that it calls this automatically for foreign tables.

(Note: I'm unconvinced that there's a use-case for this in the case of
real foreign tables on a remote server --- it seems likely that the
wrapper ought to ask the remote server for its current stats, instead.
But it's clearly useful for non-server-backed sources such as file_fdw.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-09-20 Thread Euler Taveira de Oliveira

On 20-09-2011 11:12, Marti Raudsepp wrote:

2011/9/12 Etsuro Fujitafujita.ets...@lab.ntt.co.jp:

This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)


This is a good idea.

However, if adding these statistics requires an explicit ANALYZE
command, then we should also have a command for resetting the
collected statistics -- to get it back into the un-analyzed state.

Why would you want this? If the stats aren't up to date, run ANALYZE 
periodically. Remember that it is part of the DBA maintenance tasks [1].



[1] http://www.postgresql.org/docs/current/static/maintenance.html


--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-09-20 Thread David Fetter
On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:
 Marti Raudsepp ma...@juffo.org writes:
  2011/9/12 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
  This is called when ANALYZE command is executed. (ANALYZE
  command should be executed because autovacuum does not analyze foreign
  tables.)
 
  This is a good idea.
 
  However, if adding these statistics requires an explicit ANALYZE
  command, then we should also have a command for resetting the
  collected statistics -- to get it back into the un-analyzed state.
 
 Uh, why?  There is no UNANALYZE operation for ordinary tables, and
 I've never heard anyone ask for one.
 
 If you're desperate you could manually delete the relevant rows in
 pg_statistic, a solution that would presumably work for foreign tables
 too.
 
 Probably a more interesting question is why we wouldn't change
 autovacuum so that it calls this automatically for foreign tables.

How about a per-table setting that tells autovacuum whether to do
this?  Come to think of it, all of per-FDW, per-remote and per-table
settings would be handy, so people could express things like, all CSV
files except these three, all PostgreSQL connections on the
10.1.0.0/16 network, and these two tables in Oracle.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-09-20 Thread Alvaro Herrera

Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011:
 On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:

  Probably a more interesting question is why we wouldn't change
  autovacuum so that it calls this automatically for foreign tables.
 
 How about a per-table setting that tells autovacuum whether to do
 this?

Seems reasonable.  Have autovacuum assume that foreign tables are not to
be analyzed, unless some reloption is set.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP: Collecting statistics on CSV file data

2011-09-12 Thread Etsuro Fujita
Hi there,

To enable file_fdw to estimate costs of scanning a CSV file more
accurately, I would like to propose a new FDW callback routine,
AnalyzeForeignTable, which allows to ANALYZE command to collect
statistics on a foreign table, and a corresponding file_fdw function,
fileAnalyzeForeignTable. Attached is my WIP patch.

Here's a summary of the implementation:

void AnalyzeForeignTable (Relation relation, VacuumStmt *vacstmt, int
elevel);
This is a new FDW callback routine to collect statistics on a foreign
table and store the results in the pg_class and pg_statistic system
catalogs. This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

static void fileAnalyzeForeignTable(Relation relation, VacuumStmt
*vacstmt, int elevel);
This new file_fdw function collects and stores the same statistics on
CSV file data as collected on a local table except for index related
statistics by executing the sequential scan on the CSV file and
acquiring sample rows using Vitter's algorithm. (It is time-consuming
for a large file.)

estimate_costs() (more precisely, clauselist_selectivity() in
estimate_costs()) estimates baserel-rows using the statistics stored in
the pg_statistic system catalog. If there are no statistics,
estimate_costs() estimates it using the default statistics as in
PostgreSQL 9.1.

I am able to demonstrate the effectiveness of this patch. The following
run is performed on a single core of a 3.00GHz Intel Xeon CPU with 8GB
of RAM. Configuration settings are default except for work_mem = 256MB.
We can see from this result that the optimiser selects a good plan when
the foreign tables have been analyzed.

I appreciate your comments and suggestions.

[sample csv file data]
postgres=# COPY (SELECT s.a, repeat('a', 100) FROM generate_series(1,
500) AS s(a)) TO '/home/pgsql/sample_csv_data1.csv' (FORMAT csv,
DELIMITER ',');
COPY 500
postgres=# COPY (SELECT (random()*1)::int, repeat('b', 100) FROM
generate_series(1, 500)) TO '/home/pgsql/sample_csv_data2.csv'
(FORMAT csv, DELIMITER ',');
COPY 500

[Unpatched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# SELECT count(*) FROM tab1;
  count
-
 500
(1 row)

postgres=# SELECT count(*) FROM tab2;
  count
-
 500
(1 row)

postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE
tab1.aid = 0 AND tab1.aid = 1 AND tab1.aid = tab2.aid;
   QUERY
PLAN

-

---
 Aggregate  (cost=128859182.29..128859182.30 rows=1 width=0) (actual
time=27321.304..27321.304 rows=1 loops=1)
   -  Merge Join  (cost=5787102.68..111283426.33 rows=7030302383
width=0) (actual time=22181.428..26736.194 rows=4999745 loops=1)
 Merge Cond: (tab1.aid = tab2.aid)
 -  Sort  (cost=1857986.37..1858198.83 rows=84983 width=4)
(actual time=5964.282..5965.958 rows=1 loops=1)
   Sort Key: tab1.aid
   Sort Method: quicksort  Memory: 853kB
   -  Foreign Scan on tab1  (cost=0.00..1851028.44
rows=84983 width=4) (actual time=0.071..5962.382 rows=1 loops=1)
 Filter: ((aid = 0) AND (aid = 1))
 Foreign File: /home/pgsql/sample_csv_data1.csv
 Foreign File Size: 54396
 -  Materialize  (cost=3929116.30..4011842.29 rows=16545197
width=4) (actual time=16216.953..19550.846 rows=500 loops=1)
   -  Sort  (cost=3929116.30..3970479.30 rows=16545197
width=4) (actual time=16216.947..18418.684 rows=500 loops=1)
 Sort Key: tab2.aid
 Sort Method: external merge  Disk: 68424kB
 -  Foreign Scan on tab2  (cost=0.00..1719149.70
rows=16545197 width=4) (actual time=0.081..6059.630 rows=500 loops=1)
   Foreign File: /home/pgsql/sample_csv_data2.csv
   Foreign File Size: 529446313
 Total runtime: 27350.673 ms
(18 rows)

[Patched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# ANALYZE VERBOSE tab1;
INFO:  analyzing public.tab1
INFO:  tab1: scanned, containing 500 rows; 3 rows in sample
ANALYZE