Re: [HACKERS] Cluster name in ps output

2014-07-03 Thread Fujii Masao
On Wed, Jul 2, 2014 at 3:45 AM, Vik Fearing  wrote:
> On 06/29/2014 02:25 PM, Andres Freund wrote:
>> On 2014-06-29 11:11:14 +0100, Thomas Munro wrote:
>>> > On 29 June 2014 10:55, Andres Freund  wrote:
 > > So, I'd looked at it with an eye towards committing it and found some
 > > more things. I've now
 > > * added the restriction that the cluster name can only be ASCII. It's
 > >   shown from several clusters with differing encodings, and it's shown
 > >   in process titles, so that seems wise.
 > > * moved everything to the LOGGING_WHAT category
 > > * added minimal documention to monitoring.sgml
 > > * expanded the config.sgml entry to mention the restriction on the 
 > > name.
 > > * Changed the GUCs short description
>>> >
>>> > Thank you.
>>> >
 > > I also think, but haven't done so, we should add a double colon after
 > > the cluster name, so it's not:
 > >
 > > postgres: server1 stats collector process
 > > but
 > > postgres: server1: stats collector process
>>> >
>>> > +1
>>
>> Committed with the above changes. Thanks for the contribution!
>
> Is there a reason for not using this in synchronous_standby_names,
> perhaps falling back to application_name if not set?

You mean that if synchronous_standby_names is an empty it automatically
should be set to cluster_name? Or, you mean that if application_name is not
set in primary_conninfo the standby should automatically use its cluster_name
as application_name in primary_conninfo? I'm afraid that those may cause
the trouble such as that standby is unexpectedly treated as synchronous one
even though a user want to set up it as asynchronous one by emptying
synchronous_standby_names in the master.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-03 Thread Abhijit Menon-Sen
At 2014-06-30 05:19:10 +, dilip.ku...@huawei.com wrote:
>
> Please fix these issues and send the updated patch..
> 
> I will continue reviewing the patch..

Did you get anywhere with the updated patch?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-03 Thread Martijn van Oosterhout
Fascinating.

On Fri, Jul 04, 2014 at 10:47:06AM +0800, gotoschool6g wrote:
> slow query(8531 ms):
> SELECT ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
> 40.12211338311868)')) FROM road order by id LIMIT 1;
> 
> explain output:
> "Limit  (cost=4653.48..4653.48 rows=1 width=3612)"
> "  ->  Sort  (cost=4653.48..4683.06 rows=11832 width=3612)"
> "Sort Key: id"
> "->  Seq Scan on road  (cost=0.00..4594.32 rows=11832 width=3612)"
> 
> fast query(16ms):
> select ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
> 40.12211338311868)')) from (SELECT shape FROM road order by id  LIMIT 1) a
> 
> explain output:
> "Subquery Scan on a  (cost=1695.48..1695.74 rows=1 width=3608)"
> "  ->  Limit  (cost=1695.48..1695.48 rows=1 width=3612)"
> "->  Sort  (cost=1695.48..1725.06 rows=11832 width=3612)"
> "  Sort Key: road.id"
> "  ->  Seq Scan on road  (cost=0.00..1636.32 rows=11832 
> width=3612)"

So Postgres knows perfectly well that it's expensive, it just doesn't
appear to understand it has the option of moving the calculation above
the limit.

In this case though, it seems an index on road(id) would make it
instant in any case.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-03 Thread Ashutosh Bapat
You may have to add code to copy inp_str to _copyVacuumStmt(). See how a
character array being copied from other _copy* functions.


On Fri, Jul 4, 2014 at 10:43 AM, Ashoke  wrote:

> Hi,
>
> --
>
> I have defined a new command my_command in PostgreSQL. This command takes
> the path of ANALYZE and inside analyze.c, I have a function to do some
> operations if its my_command.This command takes the input arguments:
> table name, column name and an input string.
>
> my_command nation (n_nationkey) 'input string';
>
> When I run this command from command line psql, it works as expected. But
> when I call the same command from a java application, the variable that
> stores the input string is NULL.
>
> I printed the value of the input string in gram.y file where I have
> defined my_command.
> fprintf (stderr, "I am inside gram.y %s\n",n->inp_str); and the input
> string is printed correctly.
>
> But when I print stmt->inp_str in the function standard_ProcessUtility()
>  of utility.c for the case T_VacuumStmt, I get the value as NULL. This is
> as far as I could trace back from analyze.c.
>
> I am not sure how executing the same command from an application can make
> a difference.
>
> gram.y content gist:
> --
>
> MyStmt:
> my_keyword qualified_name name_list my_inp_str
> {
> VacuumStmt *n = makeNode(VacuumStmt);
> n->options = VACOPT_ANALYZE;
> n->freeze_min_age = -1;
> n->freeze_table_age = -1;
> n->relation = $2;
> n->va_cols = $3;
> n->inp_str = $4;
> fprintf (stderr, "I am inside gram.y %s\n",n->inp_str);
>
> $$ = (Node *)n;
> };
>
> char *inp_str is added to the struct VacuumStmt in parsenodes.h
>
> ---
>
> Only the newly added char *inp_str(that is different from ANALYZE) value
> is NULL. I was able to retrieve the column name from va_cols.
>
> Any help is appreciated. Thanks!
> --
> Regards,
> Ashoke
>
>
>


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-03 Thread Abhijit Menon-Sen
At 2014-07-04 10:43:12 +0530, s.ash...@gmail.com wrote:
>
> I am not sure how executing the same command from an application can
> make a difference.

It shouldn't make any difference, of course.

But since you're seeing the problem with new code, the overwhelming
probability is that there's an error in the new code. That being the
case, speculating about what might be going wrong without looking at
the code in question is a waste of time.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-07-03 Thread Abhijit Menon-Sen
At 2014-07-04 14:38:27 +0900, masao.fu...@gmail.com wrote:
>
> But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
> be able to apply to HEAD cleanly.

Yes, and it needs quite some reformatting beyond fixing whitespace
damage too (long lines, comment formatting, consistent spacing etc.).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS Design

2014-07-03 Thread Kouhei Kaigai
Sorry for my late responding, now I'm catching up the discussion.

> * Robert Haas (robertmh...@gmail.com) wrote:
> > On Tue, Jul 1, 2014 at 3:20 PM, Dean Rasheed 
> wrote:
> > > If RLS quals are instead regarded as constraints on access, and
> > > multiple policies apply, then it seems that the quals should now be
> > > combined with AND rather than OR, right?
> 
> I do feel that RLS quals are constraints on access, but I don't see how
> it follows that multiple quals should be AND'd together because of that.
> I view the RLS policies on each table as being independent and "standing
> alone" regarding what can be seen.  If you have access to a table today
> through policy A, and then later policy B is added, using AND would mean
> that the set of rows returned is less than if only policy A existed.
> That doesn't seem correct to me.
> 
It seems to me direction of the constraints (RLS-policy) works to is reverse.

In case when we have no RLS-policy, 100% of rows are visible isn't it?
Addition of a constraint usually reduces the number of rows being visible,
or same number of rows at least. Constraint shall never work to the direction
to increase the number of rows being visible.

If multiple RLS-policies are connected with OR-operator, the first policy
works to the direction to reduce number of visible rows, but the second
policy works to the reverse direction.

If we would have OR'd RLS-policy, how does it merged with user given
qualifiers with?
For example, if RLS-policy of t1 is (t1.credential < get_user_credential)
and user's query is:
  SELECT * FROM t1 WHERE t1.x = t1.x;
Do you think RLS-policy shall be merged with OR'd form?


> > Yeah, maybe.  I intuitively feel that OR would be more useful, so it
> > would be nice to find a design where that makes sense.  But it depends
> > a lot, in my view, on what syntax we end up with.  For example,
> > suppose we add just one command:
> >
> > ALTER TABLE table_name FILTER [ role_name | PUBLIC ] USING qual;
> >
> > If the given role inherits from multiple roles that have different
> > filters, I think the user will naturally expect all of the filters to
> > be applied.
> 
> Agreed.
> 
> > But you could do it other ways.  For example:
> >
> > ALTER TABLE table_name [ NO ] ROW LEVEL SECURITY; ALTER TABLE
> > table_name GRANT ROW ACCESS TO role_name USING qual;
> >
> > If a table is set to NO ROW LEVEL SECURITY then it behaves just like
> > it does now: anyone who accesses it sees all the rows, restricted to
> > those columns for which they have permission.  If the table is set to
> > ROW LEVEL SECURITY then the default is to show no rows.  The second
> > command then allows access to a subset of the rows for a give role
> > name.  In this case, it is probably logical for access to be combined
> > via OR.
> 
> I can see value is having a table-level option to indicate if RLS is applied
> for that table or not, but I had been thinking we'd just automatically manage
> that.  That is to say that once you define an RLS policy for a table, we
> go look and see what policy should be applied in each case.  With the user
> able to control that, what happens if they say "row security" on the table
> and there are no policies?  All access would show the table as empty?  What
> if policies exist and they decide to 'turn off' RLS for the table- suddenly
> everyone can see all the rows?
> 
> My answers to the above (which are making me like the idea more,
> actually...) would be:
> 
> Yes, if they turn on RLS for the table and there aren't any policies, then
> the table appears empty for anyone with normal SELECT rights (table owner
> and superusers would still see everything).
> 
> If policies exist and the user asks to turn off RLS, I'd throw an ERROR
> as there is a security risk there.  We could support a CASCADE option which
> would go and drop the policies from the table first.
> 
Hmm... This approach starts from the empty permission then adds permission
to reference a particular range of the configured table. It's one attitude.

However, I think it has a dark side we cannot ignore. Usually, the purpose
of security mechanism is to ensure which is readable/writable according to
the rules. Once multiple RLS-policies are merged with OR'd form, its results
are unpredicatable.
Please assume here are two individual applications that use RLS on table-X.
Even if application-1 want only rows being "public" become visible, it may
expose "credential" or "secret" rows by interaction of orthogonal policy
configured by application-2 (that may configure the policy according to the
source ip-address). It seems to me application-2 partially invalidated the
RLS-policy configured by application-1.

I think, an important characteristic is things to be invisible is invisible
even though multiple rules are configured.

> Otherwise, I'm generally liking Dean's thoughts in
> http://www.postgresql.org/message-id/CAEZATCVftksFH=X+9mVmBNMZo5KsUP+R
> k0kb4oro92jofjo...@mail.gmail.com
> along with 

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-07-03 Thread Fujii Masao
On Fri, Jul 4, 2014 at 4:58 AM, Rahila Syed  wrote:
> Hello,
>
> Updated version of patches are attached.
> Changes are as follows
> 1. Improved readability of the code as per the review comments.
> 2. Addition of block_compression field in BkpBlock structure to store
> information about compression of block. This provides for switching
> compression on/off and changing compression algorithm as required.
> 3.Handling of OOM in critical section by checking for return value of malloc
> and proceeding without compression of FPW if return value is NULL.

Thanks for updating the patches!

But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
be able to apply to HEAD cleanly.

---
$ git am ~/Desktop/0001-Support-for-LZ4-and-Snappy-2.patch
Applying: Support for LZ4 and Snappy-2

$ git am ~/Desktop/0002-CompressBackupBlock_snappy_lz4_pglz-2.patch
Applying: CompressBackupBlock_snappy_lz4_pglz-2
/home/postgres/pgsql/git/.git/rebase-apply/patch:42: indent with spaces.
/*Allocates memory for compressed backup blocks according to
the compression algorithm used.Once per session at the time of
insertion of first XLOG record.
/home/postgres/pgsql/git/.git/rebase-apply/patch:43: indent with spaces.
  This memory stays till the end of session. OOM is handled by
making the code proceed without FPW compression*/
/home/postgres/pgsql/git/.git/rebase-apply/patch:58: indent with spaces.
if(compressed_pages[j] == NULL)
/home/postgres/pgsql/git/.git/rebase-apply/patch:59: space before tab in indent.
  {
/home/postgres/pgsql/git/.git/rebase-apply/patch:60: space before tab in indent.
 compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
error: patch failed: src/backend/access/transam/xlog.c:60
error: src/backend/access/transam/xlog.c: patch does not apply
Patch failed at 0001 CompressBackupBlock_snappy_lz4_pglz-2
When you have resolved this problem run "git am --resolved".
If you would prefer to skip this patch, instead run "git am --skip".
To restore the original branch and stop patching run "git am --abort".
---

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-03 Thread Ashoke
Hi,

--

I have defined a new command my_command in PostgreSQL. This command takes
the path of ANALYZE and inside analyze.c, I have a function to do some
operations if its my_command.This command takes the input arguments: table
name, column name and an input string.

my_command nation (n_nationkey) 'input string';

When I run this command from command line psql, it works as expected. But
when I call the same command from a java application, the variable that
stores the input string is NULL.

I printed the value of the input string in gram.y file where I have defined
my_command.
fprintf (stderr, "I am inside gram.y %s\n",n->inp_str); and the input
string is printed correctly.

But when I print stmt->inp_str in the function standard_ProcessUtility() of
utility.c for the case T_VacuumStmt, I get the value as NULL. This is as
far as I could trace back from analyze.c.

I am not sure how executing the same command from an application can make a
difference.

gram.y content gist:
--

MyStmt:
my_keyword qualified_name name_list my_inp_str
{
VacuumStmt *n = makeNode(VacuumStmt);
n->options = VACOPT_ANALYZE;
n->freeze_min_age = -1;
n->freeze_table_age = -1;
n->relation = $2;
n->va_cols = $3;
n->inp_str = $4;
fprintf (stderr, "I am inside gram.y %s\n",n->inp_str);

$$ = (Node *)n;
};

char *inp_str is added to the struct VacuumStmt in parsenodes.h

---

Only the newly added char *inp_str(that is different from ANALYZE) value is
NULL. I was able to retrieve the column name from va_cols.

Any help is appreciated. Thanks!
-- 
Regards,
Ashoke


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Rushabh Lathia
On Fri, Jul 4, 2014 at 7:37 AM, Tom Lane  wrote:

> Tom Dunstan  writes:
> > On 4 July 2014 00:07, Tom Lane  wrote:
> >> It sounds to me like designing this for JDBC's getGeneratedKeys method
> >> is a mistake.  There isn't going to be any way that the driver can
> support
> >> that without having looked at the table's metadata for itself, and if
> >> it's going to do that then it doesn't need a crutch that only partially
> >> solves the problem anyhow.
>
> > Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
> > from whatever was returned. It's CURRENTLY doing that, but it's appending
> > RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
> > out which columns the caller is interested in.
>
> I might be mistaken, but as I read the spec for getGeneratedKeys, whether
> or not a column is in a primary key has got nothing to do with whether
> that column should be returned.  It looks to me like what you're supposed
> to return is columns with computed default values, eg, serial columns.
> (Whether we should define it as being *exactly* columns created by the
> SERIAL macro is an interesting question, but for sure those ought to be
> included.)  Now, the pkey might be a serial column ... but it equally
> well might not be; it might not have any default expression at all.
> And there certainly could be serial columns that weren't in the pkey.
>
>
100% agree with Tom.


> The fact that the spec is kinda fuzzy doesn't entitle us to ignore the
> plain meaning of the term "generated key".
>
> regards, tom lane
>



-- 
Rushabh Lathia


Re: [HACKERS] [v9.5] Custom Plan API

2014-07-03 Thread Kouhei Kaigai
Hanada-san,

Thanks for your dedicated reviewing.

It's a very long message. So, let me summarize the things
I shall do in the next patch.

* fix bug: custom-plan class comparison
* fix up naming convention and syntax
  CREATE CUSTOM PLAN PROVIDER, rather than
  CREATE CUSTOM PLAN. Prefix shall be "cpp_".
* fix up: definition of get_custom_plan_oid()
* fix up: unexpected white spaces, to be tabs.
* fix up: remove unnecessary forward declarations.
* fix up: revert replace_nestloop_params() to static
* make SetCustomPlanRef an optional callback
* fix up: typos in various points
* add documentation to explain custom-plan interface.

Also, I want committer's opinion about the issues below
* whether set_cheapest() is called for all relkind?
* how argument of add_path handler shall be derivered?

Individual comments are put below:

> Kaigai-san,
> 
> Sorry for lagged response.
> 
> Here are my  random thoughts about the patch.  I couldn't understand the
> patch fully, because some of APIs are not used by ctidscan.  If
> 
> Custom Scan patch v2 review
> 
> * Custom plan class comparison
> In backend/optimizer/util/pathnode.c, custclass is compared by bit-and
> with 's'.  Do you plan to use custclass as bit field?  If so, values for
> custom plan class should not be a character.  Otherwise, custclass should
> be compared by == operator.
> 
Sorry, it is a bug that come from the previous design.
I had an idea that allows a custom plan provider to support multiple kind
of exec nodes, however, I concluded it does not make sense so much. (we
can define multiple CPP for each)

> * Purpose of GetSpecialCustomVar()
> The reason why FinalizeCustomPlan callback is necessary is not clear to
> me.
> Could you show a case that the API would be useful?
> 
It is needed feature to replace a built-in join by custom scan, however,
it might be unclear on the scan workloads.

Let me explain why join replacement needed. A join node has two input
slot (inner and outer), its expression node including Var node reference
either of slot according to its varno (INNER_VAR or OUTER_VAR).
In case when a CPP replaced a join, it has to generate an equivalent result
but it may not be a best choice to use two input streams.
(Please remind when we construct remote join on postgres_fdw, all the
materialization was done on remote side, thus we had one input stream to
generate local join equivalent view.)
On the other hands, EXPLAIN command has to understand what column is the
source of varnodes in targetlist of custom-node even if it is rewritten
to use just one slot. For example, which label shall be shown in case when
3rd item of targetlist is originally come from 2nd item of inner slot but
all the materialized result is stored into outer slot.
Only CPP can track its relationship between the original and the newer one.
This interface provides a way to solve a varnode that actually references.

> * Purpose of FinalizeCustomPlan()
> The reason why FinalizeCustomPlan callback is necessary is not clear to
> me, because ctidscan just calls finalize_primnode() and
> bms_add_members() with given information.  Could you show a case that the
> API would be useful?
> 
The main purpose of this callback gives an extension chance to apply
finalize_primenode() if custom-node hold expression tree on its private
fields.
In case when CPP picked up a part of clauses to run its own way, it shall
be attached on neither plan->targetlist nor plan->qual, only CPP knows
where does it attached. So, these orphan expression nodes have to be
treated by CPP.

> * Is it ok to call set_cheapest() for all relkind?
> Now set_cheapest() is called not for only relation and foreign table but
> also custom plan, and other relations such as subquery, function, and value.
> Calling call_custom_scan_provider() and set_cheapest() in the case of
> RTE_RELATION seems similar to the old construct, how do you think about
> this?
> 
I don't think we may be actually able to have some useful custom scan logic
on these special relation forms, however, I also didn't have a special reason
why custom-plan does not need to support these special relations.
I'd like to see committer's opinion here.


> * Is it hard to get rid of CopyCustomPlan()?
> Copying ForeignScan node doesn't need per-FDW copy function by limiting
> fdw_private to have only copy-able objects.  Can't we use the same way for
> CustomPlan?  Letting authors call NodeSetTag or
> copyObject() sounds uncomfortable to me.
> 
> This would be able to apply to TextOutCustomPlan() and TextOutCustomPath()
> too.
> 
FDW-like design was the original one, but the latest design was suggestion
by Tom on the v9.4 development cycle, because some data types are not
complianced to copyObject; like Bitmapset.

> * MultiExec support is appropriate for the first version?
> The cases need MultiExec seems little complex for the first version of custom
> scan.  What kind of plan do you image for this feature?
> 
It is definitely necessary to exchange multipl

Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Bruce Momjian
On Thu, Jul  3, 2014 at 11:55:40PM +0200, Andres Freund wrote:
> I don't think it's just that simple unfortunately. If pg_class entries
> get created that didn't exist on the old server there's a chance for oid
> conflicts. Consider
> 
> SELECT binary_upgrade.set_next_heap_pg_class_oid('17094'::pg_catalog.oid);
> CREATE TABLE table_without_toast_in_old_server(...);
> -- heap oid 17094, toast oid 16384
> 
> SELECT binary_upgrade.set_next_heap_pg_class_oid('16384'::pg_catalog.oid);
> CREATE TABLE another_table(...);
> ERROR: could not create file ...: File exists
> 
> I think we even had reports of such a problem.

I had not considered this.

I don't remember ever seeing such a report.  We have had oid mismatch
reports, but we now know the cause of those.

> The most robust, but not trivial, approach seems to be to prevent toast
> table creation if there wasn't a set_next_toast_pg_class_oid(). Then,
> after all relations are created, iterate over all pg_class entries that
> possibly need toast tables and recheck if they now might need one.

Wow, that is going to be kind of odd in that there really isn't a good
way to create toast tables except perhaps add a dummy TEXT column and
remove it.  There also isn't an easy way to not create a toast table,
but also find out that one was needed.  I suppose we would have to
insert some dummy value in the toast pg_class column and come back later
to clean it up.

I am wondering what the probability of having a table that didn't need a
toast table in the old cluster, and needed one in the new cluster, and
there being an oid collision.

I think the big question is whether we want to go down that route.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Bruce Momjian
On Thu, Jul  3, 2014 at 05:09:41PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I have had trouble getting a table schema that is causing problems, but
> > received a report via EDB support recently that had a simple schema
> > (anonymized):
> > ...
> > needs_toast_table() computes the length of this table as 2024 bytes in
> > 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 
> 
> > My initial idea is to to allow for toast tables in the new cluster that
> > aren't in the old cluster by skipping over the extra toast tables.  This
> > would only be for pre-9.1 old clusters.
> 
> TBH, it has never been more than the shakiest of assumptions that the new
> version could not create toast tables where the old one hadn't.  I think
> you should just allow for that case, independently of specific PG
> versions.  Fortunately it seems easy enough, since you certainly don't
> need to put any data into the new toast tables.

OK, patch attached.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
new file mode 100644
index 6205c74..17e4d5b
*** a/contrib/pg_upgrade/info.c
--- b/contrib/pg_upgrade/info.c
*** gen_db_file_maps(DbInfo *old_db, DbInfo
*** 38,58 
   int *nmaps, const char *old_pgdata, const char *new_pgdata)
  {
  	FileNameMap *maps;
! 	int			relnum;
  	int			num_maps = 0;
  
  	maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) *
  	 old_db->rel_arr.nrels);
  
! 	for (relnum = 0; relnum < Min(old_db->rel_arr.nrels, new_db->rel_arr.nrels);
! 		 relnum++)
  	{
! 		RelInfo*old_rel = &old_db->rel_arr.rels[relnum];
! 		RelInfo*new_rel = &new_db->rel_arr.rels[relnum];
  
  		if (old_rel->reloid != new_rel->reloid)
! 			pg_fatal("Mismatch of relation OID in database \"%s\": old OID %d, new OID %d\n",
! 	 old_db->db_name, old_rel->reloid, new_rel->reloid);
  
  		/*
  		 * TOAST table names initially match the heap pg_class oid. In
--- 38,85 
   int *nmaps, const char *old_pgdata, const char *new_pgdata)
  {
  	FileNameMap *maps;
! 	int			old_relnum, new_relnum;
  	int			num_maps = 0;
  
  	maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) *
  	 old_db->rel_arr.nrels);
  
! 	/*
! 	 * The old database shouldn't have more relations than the new one.
! 	 * We force the new cluster to have a TOAST table if the old table
! 	 * had one.
! 	 */
! 	if (old_db->rel_arr.nrels > new_db->rel_arr.nrels)
! 		pg_fatal("old and new databases \"%s\" have a mismatched number of relations\n",
!  old_db->db_name);
! 
! 	/* Drive the loop using new_relnum, which might be higher. */
! 	for (old_relnum = new_relnum = 0; new_relnum < new_db->rel_arr.nrels;
! 		 new_relnum++)
  	{
! 		RelInfo*old_rel = &old_db->rel_arr.rels[old_relnum];
! 		RelInfo*new_rel = &new_db->rel_arr.rels[new_relnum];
! 
  
  		if (old_rel->reloid != new_rel->reloid)
! 		{
! 			if (strcmp(new_rel->nspname, "pg_toast") == 0)
! /*
!  * It is possible that the new cluster has a TOAST table
!  * for a table that didn't need one in the old cluster,
!  * e.g. 9.0 to 9.1 changed the NUMERIC length computation.
!  * Therefore, if we have a TOAST table in the new cluster
!  * that doesn't match, skip over it and continue processing.
!  * It is possible this TOAST table used an OID that was
!  * reserved in the old cluster, but we have no way of
!  * testing that, and we would have already gotten an error
!  * at the new cluster schema creation stage.
!  */
! continue;
! 			else
! pg_fatal("Mismatch of relation OID in database \"%s\": old OID %d, new OID %d\n",
! 		 old_db->db_name, old_rel->reloid, new_rel->reloid);
! 		}
  
  		/*
  		 * TOAST table names initially match the heap pg_class oid. In
*** gen_db_file_maps(DbInfo *old_db, DbInfo
*** 76,89 
  		create_rel_filename_map(old_pgdata, new_pgdata, old_db, new_db,
  old_rel, new_rel, maps + num_maps);
  		num_maps++;
  	}
  
! 	/*
! 	 * Do this check after the loop so hopefully we will produce a clearer
! 	 * error above
! 	 */
! 	if (old_db->rel_arr.nrels != new_db->rel_arr.nrels)
! 		pg_fatal("old and new databases \"%s\" have a different number of relations\n",
   old_db->db_name);
  
  	*nmaps = num_maps;
--- 103,114 
  		create_rel_filename_map(old_pgdata, new_pgdata, old_db, new_db,
  old_rel, new_rel, maps + num_maps);
  		num_maps++;
+ 		old_relnum++;
  	}
  
! 	/* Did we fail to exhaust the old array? */
! 	if (old_relnum != old_db->rel_arr.nrels)
! 		pg_fatal("old and new databases \"%s\" have a mismatched number of relations\n",
   old_db->db_name);
  
  	*nmaps = num_maps;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog add synchronous mode

2014-07-03 Thread Fujii Masao
On Tue, Jul 1, 2014 at 10:11 PM, Fujii Masao  wrote:
> On Mon, Jun 30, 2014 at 7:09 PM,   wrote:
>>> Thanks for the review!
>>>
>>> +if (secs <= 0)
>>> +secs = 1;/* Always sleep at least 1 sec */
>>> +
>>> +sleeptime = secs * 1000 + usecs / 1000;
>>>
>>> The above is the code which caused that problem. 'usecs' should have been
>>> reset to zero when 'secs' are rounded up to 1 second. But not. Attached
>>> is the updated version of the patch.
>> Thank you for the refactoring v2 patch.
>> I did a review of the patch.
>>
>> 1. applied cleanly and compilation was without warnings and errors
>> 2. all regress tests was passed ok
>> 3. sleeptime is ok when the --status-intarvall is set to 1
>
> Thanks for reviewing the patch!
>
> I think that this refactoring patch is useful for improving source code
> readability and making the future patches simpler, whether we adopt
> your patch or not. So, barring any objections, I'm thinking to commit
> this refactoring patch.

Committed!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-03 Thread gotoschool6g
slow query(8531 ms):
SELECT ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
40.12211338311868)')) FROM road order by id LIMIT 1;

explain output:
"Limit  (cost=4653.48..4653.48 rows=1 width=3612)"
"  ->  Sort  (cost=4653.48..4683.06 rows=11832 width=3612)"
"Sort Key: id"
"->  Seq Scan on road  (cost=0.00..4594.32 rows=11832 width=3612)"

fast query(16ms):
select ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
40.12211338311868)')) from (SELECT shape FROM road order by id  LIMIT 1) a

explain output:
"Subquery Scan on a  (cost=1695.48..1695.74 rows=1 width=3608)"
"  ->  Limit  (cost=1695.48..1695.48 rows=1 width=3612)"
"->  Sort  (cost=1695.48..1725.06 rows=11832 width=3612)"
"  Sort Key: road.id"
"  ->  Seq Scan on road  (cost=0.00..1636.32 rows=11832 width=3612)"

CREATE TABLE road
(
  shape geometry,
  id integer
)
WITH (
  OIDS=FALSE
);

There are redundant call when sorting?


> On Tue, Jul 1, 2014 at 2:16 PM, Martijn van Oosterhout 
>  wrote: 
> > On Sun, Jun 29, 2014 at 10:05:50PM +0800, gotoschool6g wrote: 
> >> The simplified scene: 
> >> select slowfunction(s) from a order by b limit 1; 
> >> is slow than 
> >> select slowfunction(s) from (select s from a order by b limit 1) as z; 
> >> if there are many records in table 'a'. 
> >> 
> >> 
> >> The real scene. Function  ST_Distance_Sphere is slow, the query: 
> >> SELECT ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from road 
> >> order by c limit 1; 
> >> is slow than: 
> >> select ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from (SELECT s 
> >> from road order by c limit 1) as a; 
> >> There are about 7000 records in 'road'. 
> > 
> > I think to help here I think we need the EXPLAIN ANALYSE output for 
> > both queries. 
>  
> Well, I think the problem is a well understood one: there is no 
> guarantee that functions-in-select-list are called exactly once per 
> output row.  This is documented -- for example see here: 
> http://www.postgresql.org/docs/9.1/static/explicit-locking.html#ADVISORY-LOCKS.
>  
> In short, if you want very precise control of function evaluation use 
> a subquery, or, if you're really paranoid, a CTE. 

I'm probably dense, but I'm not sure I understand. Or it is that the 
slowfunction() is called prior to the sort? That seems insane. 

Have a nice day, 
--  
Martijn van Oosterhout  http://svana.org/kleptog/ 
> He who writes carelessly confesses thereby at the very outset that he does 
> not attach much importance to his own thoughts. 
   -- Arthur Schopenhauer 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section

2014-07-03 Thread Fujii Masao
On Fri, Jul 4, 2014 at 12:51 AM, Kevin Grittner  wrote:
> Fujii Masao  wrote:
>
>> This seems to make sense. Barring objection, I will commit this
>> only in HEAD.

Committed.

> I'm inclined to think this is a slight improvement, just for the
> sake of consistency with peer level information.  You probably
> already noticed, but the patch as submitted neglects to close the
> prior sect2 block before opening the new one.

Yes, thanks for pointing out that!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Tom Dunstan
On 4 July 2014 11:37, Tom Lane  wrote:
>
> > Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
> > from whatever was returned. It's CURRENTLY doing that, but it's appending
> > RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
> > out which columns the caller is interested in.
>
> I might be mistaken, but as I read the spec for getGeneratedKeys, whether
> or not a column is in a primary key has got nothing to do with whether
> that column should be returned.  It looks to me like what you're supposed
> to return is columns with computed default values, eg, serial columns.
> (Whether we should define it as being *exactly* columns created by the
> SERIAL macro is an interesting question, but for sure those ought to be
> included.)  Now, the pkey might be a serial column ... but it equally
> well might not be; it might not have any default expression at all.
> And there certainly could be serial columns that weren't in the pkey.
>
> The fact that the spec is kinda fuzzy doesn't entitle us to ignore the
> plain meaning of the term "generated key".
>

Well, strictly speaking no, but in reality it doesn't look like other
databases that support this feature support anything other than returning
auto-increment fields on primary keys, often only supporting a single
column (see [1]). Oracle doesn't support even that - callers have to call
the variant that specifies a column list, since historically there was no
real support for recognising such columns, oracle being sequence based.

As such, there can't be any callers in the wild that will expect this to
return anything other than the primary key, because there's no support for
doing anything else. And if the caller really DOES want other columns
returned, they can always call the variant that allows them to specify
those columns, or just issue a normal query with a returning clause. This
feature really exists to support the default case where those columns
aren't specified by the caller, and given current driver support (and
sanity) the only thing any caller will expect from such a result is the
primary key.

Cheers

Tom

[1]
http://www.postgresql.org/message-id/cappfruwqy0z66trv4xmdqnyv0prjky+38x+p4vkhmrrw5rb...@mail.gmail.com


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Tom Lane
Tom Dunstan  writes:
> On 4 July 2014 00:07, Tom Lane  wrote:
>> It sounds to me like designing this for JDBC's getGeneratedKeys method
>> is a mistake.  There isn't going to be any way that the driver can support
>> that without having looked at the table's metadata for itself, and if
>> it's going to do that then it doesn't need a crutch that only partially
>> solves the problem anyhow.

> Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
> from whatever was returned. It's CURRENTLY doing that, but it's appending
> RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
> out which columns the caller is interested in.

I might be mistaken, but as I read the spec for getGeneratedKeys, whether
or not a column is in a primary key has got nothing to do with whether
that column should be returned.  It looks to me like what you're supposed
to return is columns with computed default values, eg, serial columns.
(Whether we should define it as being *exactly* columns created by the
SERIAL macro is an interesting question, but for sure those ought to be
included.)  Now, the pkey might be a serial column ... but it equally
well might not be; it might not have any default expression at all.
And there certainly could be serial columns that weren't in the pkey.

The fact that the spec is kinda fuzzy doesn't entitle us to ignore the
plain meaning of the term "generated key".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Tom Dunstan
On 4 July 2014 00:07, Tom Lane  wrote:

> TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
> in that the application would have little idea what it was getting back.
> IF EXISTS would make it so spongy as to be useless, IMO.
>

IF EXISTS is pretty pointless - while the behaviour of getGeneratedKeys()
isn't defined for cases where there aren't any, it's only meaningful if the
caller has previously asked for the keys to be returned, and someone asking
to do that where it doesn't make sense can get an error as far as I'm
concerned. No-one does this in practice.

Looking at the feature as a more general SQL construct, ISTM that if
someone requests RETURNING PRIMARY KEY where there isn't one, an error is
appropriate. And for the IF EXISTS case, when on earth will someone request
a primary key even if they're not sure one exists?


> It sounds to me like designing this for JDBC's getGeneratedKeys method
> is a mistake.  There isn't going to be any way that the driver can support
> that without having looked at the table's metadata for itself, and if
> it's going to do that then it doesn't need a crutch that only partially
> solves the problem anyhow.
>

Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
from whatever was returned. It's CURRENTLY doing that, but it's appending
RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
out which columns the caller is interested in.

Turns out that's actually ok - most Java-based ORMs have more than enough
metadata about the tables they manage to know which columns are the primary
key. It's the driver that doesn't have that information, which is where
RETURNING PRIMARY KEY can help by not forcing the driver to request that
every single column is returned.

The only downside that I see is cases where someone requests the keys to be
returned but already has a RETURNING clause in their insert statement -
what if the requested columns don't include the primary key? IMO it's
probably enough to document that if the caller wants to issue a RETURNING
clause then they better make sure it contains the columns they're
interested in. This is already way outside anything that standard ORMs will
do (they don't know about RETURNING) so anyone doing this is hand-crafting
anyway.

Cheers

Tom


Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Tom Lane
Andrew Gierth  writes:
> Here are two more, cumulative with the previous patch and each other:
> The first removes the assumption in ordered_set_startup that the
> aggcontext can be cached in fn_extra, and puts it in the transvalue
> instead.

OK, done.  (ATM it seems like we wouldn't need gcontext in the transvalue
either, but I left it there in case we want it later.)

> The second exposes the OSA* structures in a header file, so that
> user-defined ordered-set aggs can use ordered_set_transition[_multi]
> directly rather than having to cargo-cult it into their own code.

I'm not very happy about this one; we intentionally did not expose
these structs, so that we could freely make fixes like, for example,
your immediately preceding patch.

I think it'd be worth considering whether there's a way to allow
third-party ordered-set aggs to use the infrastructure in orderedsetaggs.c
while still treating these structs as opaque.  In any case we'd need a
more carefully thought-through API than just decreeing that some private
structs are now public.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 >> How's this?  (applies and passes regression on 9.4 and master)

 Tom> Pushed with minor cosmetic adjustments.

Thanks!

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Tom Lane
Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> If we're going to do that, I think it needs to be in 9.4.  Are
>  Tom> you going to work up a patch?

> How's this?  (applies and passes regression on 9.4 and master)

Pushed with minor cosmetic adjustments.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Andres Freund
On 2014-07-03 17:09:41 -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I have had trouble getting a table schema that is causing problems, but
> > received a report via EDB support recently that had a simple schema
> > (anonymized):
> > ...
> > needs_toast_table() computes the length of this table as 2024 bytes in
> > 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 
> 
> > My initial idea is to to allow for toast tables in the new cluster that
> > aren't in the old cluster by skipping over the extra toast tables.  This
> > would only be for pre-9.1 old clusters.
> 
> TBH, it has never been more than the shakiest of assumptions that the new
> version could not create toast tables where the old one hadn't.  I think
> you should just allow for that case, independently of specific PG
> versions.  Fortunately it seems easy enough, since you certainly don't
> need to put any data into the new toast tables.

I don't think it's just that simple unfortunately. If pg_class entries
get created that didn't exist on the old server there's a chance for oid
conflicts. Consider

SELECT binary_upgrade.set_next_heap_pg_class_oid('17094'::pg_catalog.oid);
CREATE TABLE table_without_toast_in_old_server(...);
-- heap oid 17094, toast oid 16384

SELECT binary_upgrade.set_next_heap_pg_class_oid('16384'::pg_catalog.oid);
CREATE TABLE another_table(...);
ERROR: could not create file ...: File exists

I think we even had reports of such a problem.

The most robust, but not trivial, approach seems to be to prevent toast
table creation if there wasn't a set_next_toast_pg_class_oid(). Then,
after all relations are created, iterate over all pg_class entries that
possibly need toast tables and recheck if they now might need one.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Tomas Vondra
On 3.7.2014 20:50, Tomas Vondra wrote:
> Hi Stephen,
> 
> On 3.7.2014 20:10, Stephen Frost wrote:
>> Tomas,
>>
>> * Tomas Vondra (t...@fuzzy.cz) wrote:
>>> However it's likely there are queries where this may not be the case,
>>> i.e. where rebuilding the hash table is not worth it. Let me know if you
>>> can construct such query (I wasn't).
>>
>> Thanks for working on this! I've been thinking on this for a while
>> and this seems like it may be a good approach. Have you considered a
>> bloom filter over the buckets..? Also, I'd suggest you check the
>> archives from about this time last year for test cases that I was
>> using which showed cases where hashing the larger table was a better
>> choice- those same cases may also show regression here (or at least
>> would be something good to test).
> 
> Good idea, I'll look at the test cases - thanks.

I can't find the thread / test cases in the archives. I've found this
thread in hackers:

http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com

Can you point me to the right one, please?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Tom Lane
Bruce Momjian  writes:
> I have had trouble getting a table schema that is causing problems, but
> received a report via EDB support recently that had a simple schema
> (anonymized):
> ...
> needs_toast_table() computes the length of this table as 2024 bytes in
> 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 

> My initial idea is to to allow for toast tables in the new cluster that
> aren't in the old cluster by skipping over the extra toast tables.  This
> would only be for pre-9.1 old clusters.

TBH, it has never been more than the shakiest of assumptions that the new
version could not create toast tables where the old one hadn't.  I think
you should just allow for that case, independently of specific PG
versions.  Fortunately it seems easy enough, since you certainly don't
need to put any data into the new toast tables.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Tom Lane
Greg Stark  writes:
> On Thu, Jul 3, 2014 at 7:37 AM, Tom Lane  wrote:
>> TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
>> in that the application would have little idea what it was getting back.
>> IF EXISTS would make it so spongy as to be useless, IMO.

> Seems easy enough to look at the count of columns in the result set.

Sure, if you know how many columns are in the table to begin with; but
that goes back to having table metadata.  If you don't, you're probably
going to be issuing "RETURNING *", and AFAICS "RETURNING *, PRIMARY KEY"
would be entirely useless (even without IF EXISTS :-().

I'm still unconvinced that there's a robust use-case here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Array of composite types returned from python

2014-07-03 Thread Tom Lane
I wrote:
> Ronan Dunklau  writes:
>> I don't know how to do that without implementing the cache itself.

> I don't either, but my thought was that we could hack up a simple
> one-element cache pretty trivially, eg static info and desc variables
> in PLyObject_ToComposite that are initialized the first time through.
> You could only test one composite-array type per session with that
> sort of kluge, but that would be good enough for doing some simple
> performance testing.

I did that, and found that building and returning a million-element
composite array took about 4.2 seconds without any optimization, and 3.2
seconds with the hacked-up cache (as of HEAD, asserts off).  I'd say that
means we might want to do something about it eventually, but it's hardly
the first order of business.

I've committed the patch with a bit of additional cleanup.  I credited
Ronan and Ed equally as authors, since I'd say the fix for plpy.prepare
was at least as complex as the original patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Andrew Gierth
Here are two more, cumulative with the previous patch and each other:

The first removes the assumption in ordered_set_startup that the
aggcontext can be cached in fn_extra, and puts it in the transvalue
instead.

The second exposes the OSA* structures in a header file, so that
user-defined ordered-set aggs can use ordered_set_transition[_multi]
directly rather than having to cargo-cult it into their own code.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index d116a63..92fa9f3 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -47,8 +47,6 @@ typedef struct OSAPerQueryState
 	Aggref	   *aggref;
 	/* Memory context containing this struct and other per-query data: */
 	MemoryContext qcontext;
-	/* Memory context containing per-group data: */
-	MemoryContext gcontext;
 
 	/* These fields are used only when accumulating tuples: */
 
@@ -86,6 +84,8 @@ typedef struct OSAPerGroupState
 {
 	/* Link to the per-query state for this aggregate: */
 	OSAPerQueryState *qstate;
+	/* MemoryContext for per-group data */
+	MemoryContext gcontext;
 	/* Sort object we're accumulating data in: */
 	Tuplesortstate *sortstate;
 	/* Number of normal rows inserted into sortstate: */
@@ -104,6 +104,15 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	OSAPerGroupState *osastate;
 	OSAPerQueryState *qstate;
 	MemoryContext oldcontext;
+	MemoryContext gcontext;
+
+	/*
+	 * Check we're called as aggregate (and not a window function), and
+	 * get the Agg node's group-lifespan context (which might not be the
+	 * same throughout the query, but will be the same for each transval)
+	 */
+	if (AggCheckCallContext(fcinfo, &gcontext) != AGG_CONTEXT_AGGREGATE)
+		elog(ERROR, "ordered-set aggregate called in non-aggregate context");
 
 	/*
 	 * We keep a link to the per-query state in fn_extra; if it's not there,
@@ -114,16 +123,9 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	{
 		Aggref	   *aggref;
 		MemoryContext qcontext;
-		MemoryContext gcontext;
 		List	   *sortlist;
 		int			numSortCols;
 
-		/*
-		 * Check we're called as aggregate (and not a window function), and
-		 * get the Agg node's group-lifespan context
-		 */
-		if (AggCheckCallContext(fcinfo, &gcontext) != AGG_CONTEXT_AGGREGATE)
-			elog(ERROR, "ordered-set aggregate called in non-aggregate context");
 		/* Need the Aggref as well */
 		aggref = AggGetAggref(fcinfo);
 		if (!aggref)
@@ -142,7 +144,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 		qstate = (OSAPerQueryState *) palloc0(sizeof(OSAPerQueryState));
 		qstate->aggref = aggref;
 		qstate->qcontext = qcontext;
-		qstate->gcontext = gcontext;
 
 		/* Extract the sort information */
 		sortlist = aggref->aggorder;
@@ -259,10 +260,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	}
 
 	/* Now build the stuff we need in group-lifespan context */
-	oldcontext = MemoryContextSwitchTo(qstate->gcontext);
+	oldcontext = MemoryContextSwitchTo(gcontext);
 
 	osastate = (OSAPerGroupState *) palloc(sizeof(OSAPerGroupState));
 	osastate->qstate = qstate;
+	osastate->gcontext = gcontext;
 
 	/* Initialize tuplesort object */
 	if (use_tuples)
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 92fa9f3..bbad0e5 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -16,6 +16,8 @@
 
 #include 
 
+#include "utils/orderedset.h"
+
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
@@ -30,71 +32,8 @@
 #include "utils/tuplesort.h"
 
 
-/*
- * Generic support for ordered-set aggregates
- *
- * The state for an ordered-set aggregate is divided into a per-group struct
- * (which is the internal-type transition state datum returned to nodeAgg.c)
- * and a per-query struct, which contains data and sub-objects that we can
- * create just once per query because they will not change across groups.
- * The per-query struct and subsidiary data live in the executor's per-query
- * memory context, and go away implicitly at ExecutorEnd().
- */
-
-typedef struct OSAPerQueryState
-{
-	/* Aggref for this aggregate: */
-	Aggref	   *aggref;
-	/* Memory context containing this struct and other per-query data: */
-	MemoryContext qcontext;
-
-	/* These fields are used only when accumulating tuples: */
-
-	/* Tuple descriptor for tuples inserted into sortstate: */
-	TupleDesc	tupdesc;
-	/* Tuple slot we can use for inserting/extracting tuples: */
-	TupleTableSlot *tupslot;
-	/* Per-sort-column sorting information */
-	int			numSortCols;
-	AttrNumber *sortColIdx;
-	Oid		   *sortOperators;
-	Oid		   *eqOperators;
-	Oid		   *sortCollations;
-	bool	   *sortNullsFirsts;
-	/* Equality operator call info, created only if needed: */
-	FmgrInfo   *equalfns;
-
-	/* These fields are used only when accumulating datums: */
-
-	/* Inf

[HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Bruce Momjian
There have been periodic reports of pg_upgrade errors related to toast
tables.  The most recent one was from May of this year:


http://www.postgresql.org/message-id/flat/20140520202223.gb3...@momjian.us#20140520202223.gb3...@momjian.us

There error was:

Copying user relation files
   /var/lib/postgresql/8.4/main/base/4275487/4278965
Mismatch of relation OID in database "FNBooking": old OID 4279499, new 
OID 19792
Failure, exiting

and the fix is to add a dummy TEXT column to the table on the old
cluster to force a toast table, then drop the dummy column.

I have had trouble getting a table schema that is causing problems, but
received a report via EDB support recently that had a simple schema
(anonymized):

CREATE TABLE pg_upgrade_toast_test (
x1 numeric(15,0),
x2 numeric(15,0),
x3 character varying(15),
x4 character varying(60),
x5 numeric(15,0),
x6 numeric(15,0),
x7 character varying(15),
x8 character varying(60),
x9 numeric(15,0),
x10 character varying(15),
x11 character varying(60),
x12 numeric(15,0),
x13 numeric(15,0),
x14 character varying(15),
x15 character varying(60),
x16 numeric(15,0),
x17 character varying(15),
x18 character varying(60),
x19 numeric(15,0),
x20 character varying(15),
x21 character varying(60)
);

needs_toast_table() computes the length of this table as 2024 bytes in
9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 
It turns out it is this commit that causes the difference:

commit 97f38001acc61449f7ac09c539ccc29e40fecd26
Author: Robert Haas 
Date:   Wed Aug 4 17:33:09 2010 +

Fix numeric_maximum_size() calculation.

The old computation can sometimes underestimate the necessary space
by 2 bytes; however we're not back-patching this, because this 
result
isn't used for anything critical.  Per discussion with Tom Lane,
make the typmod test in this function match the ones in numeric()
and apply_typmod() exactly.

It seems the impact of this patch on pg_upgrade wasn't considered, or
even realized until now.

Suggestions on a fix?  

My initial idea is to to allow for toast tables in the new cluster that
aren't in the old cluster by skipping over the extra toast tables.  This
would only be for pre-9.1 old clusters.  It would not involve adding
toast tables to the old cluster as pg_upgrade never modifies the old
cluster.  We already handle cases where the old cluster had toast tables
and the new cluster wouldn't ordinarily have them.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 Tom> If we're going to do that, I think it needs to be in 9.4.  Are
 Tom> you going to work up a patch?

How's this?  (applies and passes regression on 9.4 and master)

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 09ff035..0388735 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2199,44 +2199,56 @@ AggGetAggref(FunctionCallInfo fcinfo)
 }
 
 /*
- * AggGetPerTupleEContext - fetch per-input-tuple ExprContext
+ * AggGetTempMemoryContext - fetch short-term memory context
  *
- * This is useful in agg final functions; the econtext returned is the
- * same per-tuple context that the transfn was called in (which can
- * safely get reset during the final function).
+ * This is useful in agg final functions; the context returned is one that the
+ * final function can safely reset as desired. This isn't useful for transition
+ * functions, since the context returned MAY (we don't promise) be the same as
+ * the context those are called in.
  *
  * As above, this is currently not useful for aggs called as window functions.
  */
-ExprContext *
-AggGetPerTupleEContext(FunctionCallInfo fcinfo)
+MemoryContext
+AggGetTempMemoryContext(FunctionCallInfo fcinfo)
 {
 	if (fcinfo->context && IsA(fcinfo->context, AggState))
 	{
 		AggState   *aggstate = (AggState *) fcinfo->context;
 
-		return aggstate->tmpcontext;
+		return aggstate->tmpcontext->ecxt_per_tuple_memory;
 	}
 	return NULL;
 }
 
 /*
- * AggGetPerAggEContext - fetch per-output-tuple ExprContext
+ * AggRegisterCallback - register a cleanup callback linked to aggcontext
  *
  * This is useful for aggs to register shutdown callbacks, which will ensure
- * that non-memory resources are freed.
+ * that non-memory resources are freed.  These callbacks will occur just before
+ * the associated aggcontext (as returned by AggCheckCallContext) is reset,
+ * either between groups or as a result of rescanning the query.  They will NOT
+ * be called on error paths.  The primary intent is to allow for freeing of
+ * tuplestores or tuplesorts maintained in aggcontext, or pins held by slots
+ * created by the agg functions.  (They will not be called until after the
+ * result of the finalfn is no longer needed, so it's safe for the finalfn to
+ * return data that will be freed by the callback.)
  *
  * As above, this is currently not useful for aggs called as window functions.
  */
-ExprContext *
-AggGetPerAggEContext(FunctionCallInfo fcinfo)
+void
+AggRegisterCallback(FunctionCallInfo fcinfo,
+	ExprContextCallbackFunction func,
+	Datum arg)
 {
 	if (fcinfo->context && IsA(fcinfo->context, AggState))
 	{
 		AggState   *aggstate = (AggState *) fcinfo->context;
 
-		return aggstate->ss.ps.ps_ExprContext;
+		RegisterExprContextCallback(aggstate->ss.ps.ps_ExprContext, func, arg);
+
+		return;
 	}
-	return NULL;
+	elog(ERROR,"Aggregate function cannot register a callback in this context");
 }
 
 
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index efb0411..d116a63 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -49,8 +49,6 @@ typedef struct OSAPerQueryState
 	MemoryContext qcontext;
 	/* Memory context containing per-group data: */
 	MemoryContext gcontext;
-	/* Agg plan node's output econtext: */
-	ExprContext *peraggecontext;
 
 	/* These fields are used only when accumulating tuples: */
 
@@ -117,7 +115,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 		Aggref	   *aggref;
 		MemoryContext qcontext;
 		MemoryContext gcontext;
-		ExprContext *peraggecontext;
 		List	   *sortlist;
 		int			numSortCols;
 
@@ -133,10 +130,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 			elog(ERROR, "ordered-set aggregate called in non-aggregate context");
 		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind))
 			elog(ERROR, "ordered-set aggregate support function called for non-ordered-set aggregate");
-		/* Also get output exprcontext so we can register shutdown callback */
-		peraggecontext = AggGetPerAggEContext(fcinfo);
-		if (!peraggecontext)
-			elog(ERROR, "ordered-set aggregate called in non-aggregate context");
 
 		/*
 		 * Prepare per-query structures in the fn_mcxt, which we assume is the
@@ -150,7 +143,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 		qstate->aggref = aggref;
 		qstate->qcontext = qcontext;
 		qstate->gcontext = gcontext;
-		qstate->peraggecontext = peraggecontext;
 
 		/* Extract the sort information */
 		sortlist = aggref->aggorder;
@@ -291,9 +283,9 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	osastate->number_of_rows = 0;
 
 	/* Now register a shutdown callback to clean things up */
-	RegisterExprContextCallback(qstate->peraggecontext,
-ordered_set_shutdown,
-PointerGetDatum(osastate));
+	AggRegisterCallback(fcinfo,
+		ordered_set_s

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Greg Stark
On Thu, Jul 3, 2014 at 7:37 AM, Tom Lane  wrote:
> TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
> in that the application would have little idea what it was getting back.
> IF EXISTS would make it so spongy as to be useless, IMO.

Seems easy enough to look at the count of columns in the result set.
But it seems like noise words -- if you don't put IF EXISTS then
surely you'll get the same behaviour anyways, no?

Fwiw when I wrote ORM-like layers I used to describe the output of the
query, including sometimes issuing WHERE 1<>0 queries and looking at
the output column types when I needed that before executing the query.
Using table metadata would have required a much more in depth
understanding of how the database worked and also wouldn't handle
expressions, joins, set operations, etc.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Greg Stark
On Thu, Jul 3, 2014 at 11:40 AM, Atri Sharma  wrote:
> IIRC, last time when we tried doing bloom filters, I was short of some real
> world useful hash functions that we could use for building the bloom filter.

Last time was we wanted to use bloom filters in hash joins to filter
out tuples that won't match any of the future hash batches to reduce
the amount of tuples that need to be spilled to disk. However the
problem was that it was unclear for a given amount of memory usage how
to pick the right size bloom filter and how to model how much it would
save versus how much it would cost in reduced hash table size.

I think it just required some good empirical tests and hash join heavy
workloads to come up with some reasonable guesses. We don't need a
perfect model just some reasonable bloom filter size that we're pretty
sure will usually help more than it hurts.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Tomas Vondra
Hi Stephen,

On 3.7.2014 20:10, Stephen Frost wrote:
> Tomas,
> 
> * Tomas Vondra (t...@fuzzy.cz) wrote:
>> However it's likely there are queries where this may not be the case,
>> i.e. where rebuilding the hash table is not worth it. Let me know if you
>> can construct such query (I wasn't).
> 
> Thanks for working on this! I've been thinking on this for a while
> and this seems like it may be a good approach. Have you considered a
> bloom filter over the buckets..? Also, I'd suggest you check the

I know you've experimented with it, but I haven't looked into that yet.

> archives from about this time last year for test cases that I was
> using which showed cases where hashing the larger table was a better
> choice- those same cases may also show regression here (or at least
> would be something good to test).

Good idea, I'll look at the test cases - thanks.

> Have you tried to work out what a 'worst case' regression for this 
> change would look like? Also, how does the planning around this
> change? Are we more likely now to hash the smaller table (I'd guess
> 'yes' just based on the reduction in NTUP_PER_BUCKET, but did you
> make any changes due to the rehashing cost?)?

The case I was thinking about is underestimated cardinality of the inner
table and a small outer table. That'd lead to a large hash table and
very few lookups (thus making the rehash inefficient). I.e. something
like this:

  Hash Join
 Seq Scan on small_table (rows=100) (actual rows=100)
 Hash
Seq Scan on bad_estimate (rows=100) (actual rows=10)
Filter: ((a < 100) AND (b < 100))

But I wasn't able to reproduce this reasonably, because in practice
that'd lead to a nested loop or something like that (which is a planning
issue, impossible to fix in hashjoin code).

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Atri Sharma
On Thu, Jul 3, 2014 at 11:40 PM, Stephen Frost  wrote:

> Tomas,
>
> * Tomas Vondra (t...@fuzzy.cz) wrote:
> > However it's likely there are queries where this may not be the case,
> > i.e. where rebuilding the hash table is not worth it. Let me know if you
> > can construct such query (I wasn't).
>
> Thanks for working on this!  I've been thinking on this for a while and
> this seems like it may be a good approach.  Have you considered a bloom
> filter?
>

IIRC, last time when we tried doing bloom filters, I was short of some real
world useful hash functions that we could use for building the bloom filter.

If we are restarting experiments on this, I would be glad to assist.

Regards,

Atri


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-03 Thread Tomas Vondra
On 3.7.2014 15:42, Atri Sharma wrote:
> 
> 
> 
> On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra  > wrote:
> 
> On 30.6.2014 23:12, Tomas Vondra wrote:
> > Hi,
> >
> > attached is v5 of the patch. The main change is that scaling the
> number
> > of buckets is done only once, after the initial hash table is
> build. The
> > main advantage of this is lower price. This also allowed some
> cleanup of
> > unecessary code.
> >
> > However, this new patch causes warning like this:
> >
> > WARNING:  temporary file leak: File 231 still referenced
> >
> > I've been eyeballing the code for a while now, but I still fail to see
> > where this comes from :-( Any ideas?
> 
> Meh, the patches are wrong - I haven't realized the tight coupling
> between buckets/batches in ExecHashGetBucketAndBatch:
> 
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
> 
> The previous patches worked mostly by pure luck (the nbuckets was set
> only in the first batch), but once I moved the code at the end, it
> started failing. And by "worked" I mean "didn't throw an error, but
> probably returned bogus results with (nbatch>1)".
> 
> However, ISTM this nbuckets-nbatch coupling is not really necessary,
> it's only constructed this way to assign independent batch/bucket. So I
> went and changed the code like this:
> 
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
> 
> I.e. using the top bits for batchno, low bits for bucketno (as before).
> 
> Hopefully I got it right this time. At least it seems to be working for
> cases that failed before (no file leaks, proper rowcounts so far).
> 
> 
> Hi,
> 
> I had a look at the patch and I was wondering if there is a way we can
> set a cap on the resized size, and instead increase the number of
> batches instead? Since this patch evaluates the growth of tuples vs
> increase of space, we could look at increasing the number of batches
> itself instead of number of buckets, if the resized number of buckets
> exceeds a threshold. It shouldnt be too hard, AIUI it would involve a
> call in MultiExecHash right after the new cost evaluation the patch adds.

So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET
(i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the
number of batches?

Hmmm, that'd be easy to implement. I don't think it's possible to do
that in MultiExecHash (that's too late). But it's a matter of changing
this condition in ExecHashTableInsert:

if (hashtable->spaceUsed > hashtable->spaceAllowed)
ExecHashIncreaseNumBatches(hashtable);

to something like this

if ((hashtable->spaceUsed > hashtable->spaceAllowed) ||
(hashtable->totalTuples / hashtable->nbatch
  == NTUP_PER_BUCKET * hashtable->nbuckets))
ExecHashIncreaseNumBatches(hashtable);

But I don't think increasing number of batches is a good approach, as it
means more rescans. I think using memory is more efficient (after all,
that's what work_mem is for, right?).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Stephen Frost
Tomas,

* Tomas Vondra (t...@fuzzy.cz) wrote:
> However it's likely there are queries where this may not be the case,
> i.e. where rebuilding the hash table is not worth it. Let me know if you
> can construct such query (I wasn't).

Thanks for working on this!  I've been thinking on this for a while and
this seems like it may be a good approach.  Have you considered a bloom
filter over the buckets..?  Also, I'd suggest you check the archives
from about this time last year for test cases that I was using which
showed cases where hashing the larger table was a better choice- those
same cases may also show regression here (or at least would be something
good to test).

Have you tried to work out what a 'worst case' regression for this
change would look like?  Also, how does the planning around this change?
Are we more likely now to hash the smaller table (I'd guess 'yes' just
based on the reduction in NTUP_PER_BUCKET, but did you make any changes
due to the rehashing cost?)?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Tomas Vondra
On 3.7.2014 02:13, Tomas Vondra wrote:
> Hi,
> 
> while hacking on the 'dynamic nbucket' patch, scheduled for the next CF
> (https://commitfest.postgresql.org/action/patch_view?id=1494) I was
> repeatedly stumbling over NTUP_PER_BUCKET. I'd like to propose a change
> in how we handle it.
> 
> 
> TL;DR; version
> --
> 
> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
> once the table is built and there's free space in work_mem. The patch
> mentioned above makes implementing this possible / rather simple.

Attached is v1 of this experimental patch. It's supposed to be applied
on top of v7 of this patch

   http://www.postgresql.org/message-id/53b59498.3000...@fuzzy.cz

as it shared most of the implementation. So apply it like this:

   patch -p1 < hashjoin-nbuckets-growth-v7.patch
   patch -p1 < hashjoin-dynamic-ntup-v1.patch

It implements the ideas outlined in the previous message, most importantly:

   (a) decreases NTUP_PER_BUCKET to 4

   (b) checks free work_mem when deciding whether to add batch

   (c) after building the batches, increases the number of buckets as
   much as possible, i.e. up to the number of batch tuples, but not
   exceeding work_mem

The improvements for the queries I tried so far are quite significant
(partially due to the NTUP_PER_BUCKET decrease, partially due to the
additional bucket count increase).

The rebuild is quite fast - the patch measures and reports this as a
WARNING, and the timings I've seen are ~12ms per 7MB (i.e. ~120ms for
70MB and ~1200ms for 700MB). Of course, this only makes sense when
compared to how much time it saved, but for the queries I tested so far
this was a good investment.

However it's likely there are queries where this may not be the case,
i.e. where rebuilding the hash table is not worth it. Let me know if you
can construct such query (I wasn't).

regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 53642d1..a4623dc 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -58,7 +58,7 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			4
 
 /* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
  * 
@@ -77,6 +77,8 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 #define NTUP_GROW_COEFFICIENT	1.333
 #define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
 
+#define SPACE_USED(hashtable, nbuckets) ((hashtable)->spaceUsed + (nbuckets) * sizeof(void*))
+
 /* 
  *		ExecHash
  *
@@ -156,21 +158,33 @@ MultiExecHash(HashState *node)
 		}
 	}
 
-	/* If average number of tuples per bucket is over the defined threshold,
-	 * increase the number of buckets to get below it. */
+/*
+ * Consider resizing the hash table (number of buckets) for better
+ * lookup performance. The code in ExecHashTableInsert guarantees
+ * we have enough memory to reach NTUP_PER_BUCKET, but maybe we can
+ * do better - getting lower number of tuples per bucket (up to
+ * NTUP_PER_BUCKET=1).
+ */
 	if (enable_hashjoin_bucket) {
 
-		/* consider only tuples in the non-skew buckets */
-		double nonSkewTuples = (hashtable->totalTuples - hashtable->skewTuples);
-
-		if ((nonSkewTuples / hashtable->nbatch) > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+instr_time start_time, end_time;
 
 #ifdef HJDEBUG
-			printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
-   nbuckets,  (batchTuples / hashtable->nbuckets));
+		printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
+nbuckets,  (batchTuples / hashtable->nbuckets));
 #endif
-			ExecHashIncreaseNumBuckets(hashtable);
-		}
+
+elog(WARNING, "hash resize (start) : nbuckets=%d", hashtable->nbuckets);
+
+INSTR_TIME_SET_CURRENT(start_time);
+
+ExecHashIncreaseNumBuckets(hashtable);
+
+INSTR_TIME_SET_CURRENT(end_time);
+INSTR_TIME_SUBTRACT(end_time, start_time);
+
+elog(WARNING, "hash resize (end) : nbuckets=%d duration=%.3f", hashtable->nbuckets, INSTR_TIME_GET_MILLISEC(end_time));
+
 	}
 
 	/* must provide our own instrumentation support */
@@ -738,35 +752,34 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
 	int			oldnbuckets = hashtable->nbuckets;
 	HashJoinTuple  *oldbuckets = hashtable->buckets;
 	MemoryContext   oldcxt;
-	double		batchTuples = (hashtable->totalTuples / hashtable->nbatch);
+
+/* average number of tuples per batch */
+double  batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch;
+
+/* memory available for buckets */
+SizefreeMemory = (hashtable->spaceAllowed - hashtable->spaceUsed);
 
 	/*
-	 * Determine the proper number of buckets, i.e. stop once the average
-	 * per bucket gets below

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-03 Thread Tomas Vondra
On 1.7.2014 01:24, Tomas Vondra wrote:
> On 30.6.2014 23:12, Tomas Vondra wrote:
>> Hi,
>
> Hopefully I got it right this time. At least it seems to be working for
> cases that failed before (no file leaks, proper rowcounts so far).

Attached v7, fixing nbatch/ntuples in an assert.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..53642d1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1. (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1. etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* 
  *		ExecHash
  *
@@ -116,6 +145,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +156,23 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hashjoin_bucket) {
+
+		/* consider only tuples in the non-skew buckets */
+		double nonSkewTuples = (hashtable->totalTuples - hashtable->skewTuples);
+
+		if ((nonSkewTuples / hashtable->nbatch) > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
+   nbuckets,  (batchTuples / hashtable->nbuckets));
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+	}
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -239,6 +286,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +311,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power 

Re: [HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Andreas Joseph Krogh
På torsdag 03. juli 2014 kl. 16:16:01, skrev Tom Lane mailto:t...@sss.pgh.pa.us>>: Andreas Joseph Krogh  writes:
 > Hi. �� I'm up for testing 9.4 but my JDBC-driver fails to connect due to 
PG's
 > minor-version string: "4beta1". Is it possible to set this somewhere without
 > recompiling PG?

 No, and even if you could, that would be the wrong approach.  The right
 approach is to fix the JDBC driver to not complain about such version
 strings.  I'm a bit surprised they haven't done so long since, considering
 how long PG beta versions have been tagged like that.  For that matter,
 they really ought not complain about strings like "9.5devel" or
 "9.5alpha2" either.

 regards, tom lane   I'm using the pgjdbc-ng driver, I'm sure the "official" 
driver handles this. The driver will be fixed (the issue has a pull-request), I 
just wondered if there was a magic know I could use until the PR is merged.   --
Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 56 963 
andr...@visena.com  www.visena.com 
    

Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section

2014-07-03 Thread Kevin Grittner
Fujii Masao  wrote:

> This seems to make sense. Barring objection, I will commit this
> only in HEAD.

I'm inclined to think this is a slight improvement, just for the
sake of consistency with peer level information.  You probably
already noticed, but the patch as submitted neglects to close the
prior sect2 block before opening the new one.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Tom Lane
Rushabh Lathia  writes:
> On Thu, Jul 3, 2014 at 6:56 AM, Ian Barwick  wrote:
>> A better solution would be to have an optional "IF EXISTS" clause:
>> RETURNING PRIMARY KEY [ IF EXISTS ]

> Hmm liked the idea about adding [ IF EXISTS ]. Looking at the grammar so
> far we
> having [ IF EXISTS ] option only for DDL commands (mainly DROP) not sure
> whether
> its ok to use such syntax for DML commands.

> Others please share your thoughts/comments.

TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
in that the application would have little idea what it was getting back.
IF EXISTS would make it so spongy as to be useless, IMO.

It sounds to me like designing this for JDBC's getGeneratedKeys method
is a mistake.  There isn't going to be any way that the driver can support
that without having looked at the table's metadata for itself, and if
it's going to do that then it doesn't need a crutch that only partially
solves the problem anyhow.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Tom Lane
Andreas Joseph Krogh  writes:
> Hi.   I'm up for testing 9.4 but my JDBC-driver fails to connect due to PG's 
> minor-version string: "4beta1". Is it possible to set this somewhere without 
> recompiling PG?

No, and even if you could, that would be the wrong approach.  The right
approach is to fix the JDBC driver to not complain about such version
strings.  I'm a bit surprised they haven't done so long since, considering
how long PG beta versions have been tagged like that.  For that matter,
they really ought not complain about strings like "9.5devel" or
"9.5alpha2" either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-07-03 Thread Kevin Grittner
Amit Kapila  wrote:

> Overall I agree that following Robert's idea will increase the
> time to make database server up and reach a state where apps can
> connect and start operations,

I agree that warming the cache before beginning to apply WAL would
be best.

> but I think atleast with such an approach we can claim that after
> warming buffers with pg_hibernator apps will always have
> performance greater than equal to the case when there is no
> pg_hibernator.

I would be careful of this claim in a NUMA environment.  The whole
reason I investigated NUMA issues and wrote the NUMA patch is that
a customer had terrible performance which turned out to be caused
by cache warming using a single connection (and thus a single
process, and thus a single NUMA memory node).  To ensure that this
doesn't trash performance on machines with many cores and memory
nodes, it would be best to try to spread the data around among
nodes.  One simple way of doing this would be to find some way to
use multiple processes in hopes that they would run on difference
CPU packages.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.5] Custom Plan API

2014-07-03 Thread Shigeru Hanada
Kaigai-san,

Sorry for lagged response.

Here are my  random thoughts about the patch.  I couldn't understand
the patch fully, because some of APIs are not used by ctidscan.  If

Custom Scan patch v2 review

* Custom plan class comparison
In backend/optimizer/util/pathnode.c, custclass is compared by bit-and
with 's'.  Do you plan to use custclass as bit field?  If so, values
for custom plan class should not be a character.  Otherwise, custclass
should be compared by == operator.

* Purpose of GetSpecialCustomVar()
The reason why FinalizeCustomPlan callback is necessary is not clear to me.
Could you show a case that the API would be useful?

* Purpose of FinalizeCustomPlan()
The reason why FinalizeCustomPlan callback is necessary is not clear
to me, because ctidscan just calls finalize_primnode() and
bms_add_members() with given information.  Could you show a case that
the API would be useful?

* Is it ok to call set_cheapest() for all relkind?
Now set_cheapest() is called not for only relation and foreign table
but also custom plan, and other relations such as subquery, function,
and value.  Calling call_custom_scan_provider() and set_cheapest() in
the case of RTE_RELATION seems similar to the old construct, how do
you think about this?

* Is it hard to get rid of CopyCustomPlan()?
Copying ForeignScan node doesn't need per-FDW copy function by
limiting fdw_private to have only copy-able objects.  Can't we use the
same way for CustomPlan?  Letting authors call NodeSetTag or
copyObject() sounds uncomfortable to me.

This would be able to apply to TextOutCustomPlan() and TextOutCustomPath() too.

* MultiExec support is appropriate for the first version?
The cases need MultiExec seems little complex for the first version of
custom scan.  What kind of plan do you image for this feature?

* Does SupportBackwardScan() have enough information?
Other scans check target list with TargetListSupportsBackwardScan().
Isn't it necessary to check it for CustomPlan too in
ExecSupportsBackwardScan()?

* Place to call custom plan provider
Is it necessary to call provider when relkind != RELKIND_RELATION?  If
yes, isn't it necessary to call for append relation?

I know that we concentrate to replacing scan in the initial version,
so it would not be a serious problem, but it would be good to consider
extensible design.

* Custom Plan Provider is "addpath"?
Passing addpath handler as only one attribute of CUSTOM PLAN PROVIDER
seems little odd.
Using handler like FDW makes the design too complex and/or messy?

* superclass of CustomPlanState
CustomPlanState derives ScanState, instead of deriving PlanState
directly.  I worry the case of non-heap-scan custom plan, but it might
be ok to postpone consideration about that at the first cut.

* Naming and granularity of objects related to custom plan
I'm not sure the current naming is appropriate, especially difference
between "custom plan" and "provider" and "handler".  In the context of
CREATE CUSTOM PLAN statement, what the term "custom plan" means?  My
impression is that "custom plan" is an alternative plan type, e.g.
ctidscan or pg_strom_scan.  Then what the term "provider" means?  My
impression for that is extension, such as ctidscan and pg_strom.  The
grammar allows users to pass function via PROVIDER clause of CREATE
CUSTOM SCAN, so the function would be the provider of the custom plan
created by the statement.

* enable_customscan
GUC parameter enable_customscan would be useful for users who want to
disable custom plan feature temporarily.  In the case of pg_strom,
using GPU for limited sessions for analytic or batch applications
seems handy.

* Adding pg_custom_plan catalog
Using "cust" as prefix for pg_custom_plan causes ambiguousness which
makes it difficult to choose catalog prefix for a feature named
"Custom Foo" in future.  How about using "cusp" (CUStom Plan)?

Or is it better to use pg_custom_plan_provider as catalog relation
name, as the document says that "CREATE CUSTOM PLAN defines custom
plan provider".  Then prefix could be "cpp" (Custom Plan Provider).
This seems to match the wording used for pg_foreign_data_wrapper.

* CREATE CUSTOM PLAN statement
This is just a question:  We need to emit CREATE CUSTOM PLAN if we want to use
I wonder how it is extended when supporting join as custom class.

* New operators about TID comparison
IMO this portion should be a separated patch, because it adds OID
definition of existing operators such as tidgt and tidle.  Is there
any (explicit or implicit) rule about defining macro for oid of an
operator?

* Prototype of get_custom_plan_oid()
custname (or cppname if use the rule I proposed above) seems
appropriate as the parameter name of get_custom_plan_oid() because
similar funcitons use catalog column names in such case.

* Coding conventions
Some lines are indented with white space.  Future pgindent run will
fix this issue?

* Unnecessary struct forward declaration
Forward declarations of CustomPathMethods, Plan, and CustomPlan 

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-03 Thread Atri Sharma
On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra  wrote:

> On 30.6.2014 23:12, Tomas Vondra wrote:
> > Hi,
> >
> > attached is v5 of the patch. The main change is that scaling the number
> > of buckets is done only once, after the initial hash table is build. The
> > main advantage of this is lower price. This also allowed some cleanup of
> > unecessary code.
> >
> > However, this new patch causes warning like this:
> >
> > WARNING:  temporary file leak: File 231 still referenced
> >
> > I've been eyeballing the code for a while now, but I still fail to see
> > where this comes from :-( Any ideas?
>
> Meh, the patches are wrong - I haven't realized the tight coupling
> between buckets/batches in ExecHashGetBucketAndBatch:
>
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
>
> The previous patches worked mostly by pure luck (the nbuckets was set
> only in the first batch), but once I moved the code at the end, it
> started failing. And by "worked" I mean "didn't throw an error, but
> probably returned bogus results with (nbatch>1)".
>
> However, ISTM this nbuckets-nbatch coupling is not really necessary,
> it's only constructed this way to assign independent batch/bucket. So I
> went and changed the code like this:
>
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
>
> I.e. using the top bits for batchno, low bits for bucketno (as before).
>
> Hopefully I got it right this time. At least it seems to be working for
> cases that failed before (no file leaks, proper rowcounts so far).
>
>
Hi,

I had a look at the patch and I was wondering if there is a way we can set
a cap on the resized size, and instead increase the number of batches
instead? Since this patch evaluates the growth of tuples vs increase of
space, we could look at increasing the number of batches itself instead of
number of buckets, if the resized number of buckets exceeds a threshold. It
shouldnt be too hard, AIUI it would involve a call in MultiExecHash right
after the new cost evaluation the patch adds.

It isnt a target of the current patch, but I think that it is a logical
extension to the same.

Thoughts/Comments?

Regards,

Atri
-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section

2014-07-03 Thread Fujii Masao
On Thu, Jul 3, 2014 at 4:34 AM, Michael Banck  wrote:
> Hi,
>
> While reading through the Explicit Locking section of the manual today,
> I felt the last paragraph of section 13.3.2. (Row-level Locks) might
> merit its own subsection.  It talks about page-level locks as distinct
> from table- and row-level locks.  Then again, it is just one paragraph,
> so maybe this was deliberate and/or rejected before (though I couldn't
> find prior discussion off-hand).  Proposed patch attached.

This seems to make sense. Barring objection, I will commit this only in HEAD.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should extension--unpackaged-$ver.sql's \echo use FROM unpackaged;?

2014-07-03 Thread Fujii Masao
On Thu, Jul 3, 2014 at 4:26 AM, Andres Freund  wrote:
> Hi,
>
> Fujii noticed that I'd used
> \echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> in an extension upgrade script. That's obviously wrong, because it
> should say ALTER EXTENSION. The reason it's that way is that I copied
> the line from the unpackaged--1.0.sql file.
> I noticed all unpackaged--$version transition files miss the "FROM
> unpackaged" in that echo.
> I guess that's just a oversight which we should correct?

Maybe for user-friendness. But I think that's not so important fix enough
to backpatch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Andres Freund
On 2014-07-03 21:27:53 +0900, Fujii Masao wrote:
> > Add drawing random integers with a Gaussian or truncated exponentional
> > distributions to pgbench.
> >
> > Test variants with these distributions are also provided and triggered
> > with options "--gaussian=..." and "--exponential=...".
> 
> IIRC we've not reached consensus about whether we should support
> such options in pgbench. Several hackers disagreed to support them.

Yea. I certainly disagree with the patch in it's current state because
it copies the same 15 lines several times with a two word
difference. Independent of whether we want those options, I don't think
that's going to fly.

> OTOH, we've almost reached the consensus that supporting gaussian
> and exponential options in \setrandom. So I think that you should
> separate those two features into two patches, and we should apply
> the \setrandom one first. Then we can discuss whether the other patch
> should be applied or not.

Sounds like a good plan.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Fujii Masao
On Wed, Jul 2, 2014 at 6:05 PM, Fabien COELHO  wrote:
>
> Hello Mitsumasa-san,
>
>> And I'm also interested in your "decile percents" output like under
>> followings,
>> decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
>
>
> Sure, I'm really fine with that.
>
>
>> I think that it is easier than before. Sum of decile percents is just
>> 100%.
>
>
> That's a good property:-)
>
>> However, I don't prefer "highest/lowest percentage" because it will be
>> confused with decile percentage for users, and anyone cannot understand this
>> digits. I cannot understand "4.9%, 0.0%" when I see the first time. Then, I
>> checked the source code, I understood it:( It's not good design... #Why this
>> parameter use 100?
>
>
> What else? People have ten fingers and like powers of 10, and are used to
> percents?
>
>
>> So I'd like to remove it if you like. It will be more simple.
>
>
> I think that for the exponential distribution it helps, especially for high
> threshold, to have the lowest/highest percent density. For low thresholds,
> the decile is also definitely useful. So I'm fine with both outputs as you
> have put them.
>
> I have just updated the wording so that it may be clearer:
>
>  decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
>  probability of fist/last percent of the range: 11.3% 0.0%
>
>
>> Attached patch is fixed version, please confirm it.
>
>
> Attached a v15 which just fixes a typo and the above wording update. I'm
> validating it for committers.
>
>
>> #Of course, World Cup is being held now. I'm not hurry at all.
>
>
> I'm not a soccer kind of person, so it does not influence my
> availibility.:-)
>
>
> Suggested commit message:
>
> Add drawing random integers with a Gaussian or truncated exponentional
> distributions to pgbench.
>
> Test variants with these distributions are also provided and triggered
> with options "--gaussian=..." and "--exponential=...".

IIRC we've not reached consensus about whether we should support
such options in pgbench. Several hackers disagreed to support them.
OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-03 Thread Simon Riggs
On 3 July 2014 06:45, Amit Khandekar  wrote:

> In GetBufferWithoutRelcache(), I was wondering, rather than calling
> PinBuffer(), if we do this :
> LockBufHdr(buf);
> PinBuffer_Locked(buf);
> valid = ((buf->flags & BM_VALID) != 0);
> then we can avoid having the new buffer access strategy BAS_DISCARD that is
> introduced in this patch. And so the code changes in freelist.c would not be
> necessary.

That looks like a good idea, thanks.

I think we should say this though

LockBufHdr(buf);
valid = ((buf->flags & BM_VALID) != 0);
if (valid)
PinBuffer_Locked(buf);
else
UnlockBufHdr(buf);

since otherwise we would access the buffer flags without the spinlock
and we would leak a pin if the buffer was not valid

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-03 Thread Abhijit Menon-Sen
FYI, I've attached a patch that does what you suggested. I haven't done
anything else (i.e. testing) with it yet.

-- Abhijit
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 5f9fc49..dc90c02 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -501,33 +501,9 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
 	 * here during crash recovery.
 	 */
 	if (HotStandbyActiveInReplay())
-	{
-		BlockNumber blkno;
-
-		for (blkno = xlrec->lastBlockVacuumed + 1; blkno < xlrec->block; blkno++)
-		{
-			/*
-			 * We use RBM_NORMAL_NO_LOG mode because it's not an error
-			 * condition to see all-zero pages.  The original btvacuumpage
-			 * scan would have skipped over all-zero pages, noting them in FSM
-			 * but not bothering to initialize them just yet; so we mustn't
-			 * throw an error here.  (We could skip acquiring the cleanup lock
-			 * if PageIsNew, but it's probably not worth the cycles to test.)
-			 *
-			 * XXX we don't actually need to read the block, we just need to
-			 * confirm it is unpinned. If we had a special call into the
-			 * buffer manager we could optimise this so that if the block is
-			 * not in shared_buffers we confirm it as unpinned.
-			 */
-			buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno,
-			RBM_NORMAL_NO_LOG);
-			if (BufferIsValid(buffer))
-			{
-LockBufferForCleanup(buffer);
-UnlockReleaseBuffer(buffer);
-			}
-		}
-	}
+		XLogLockBlockRangeForCleanup(xlrec->node, MAIN_FORKNUM,
+	 xlrec->lastBlockVacuumed + 1,
+	 xlrec->block);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b7829ff..d84f83a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -287,10 +287,6 @@ XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
  *
  * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
  * relation is extended with all-zeroes pages up to the given block number.
- *
- * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
- * exist, and we don't check for all-zeroes.  Thus, no log entry is made
- * to imply that the page should be dropped or truncated later.
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
@@ -331,8 +327,6 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			log_invalid_page(rnode, forknum, blkno, false);
 			return InvalidBuffer;
 		}
-		if (mode == RBM_NORMAL_NO_LOG)
-			return InvalidBuffer;
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
@@ -375,6 +369,62 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogBlockRangeForCleanup is used in Hot Standby mode to emulate the
+ * locking effects imposed by VACUUM for nbtrees.
+ */
+void
+XLogLockBlockRangeForCleanup(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber startBlkno,
+			 BlockNumber uptoBlkno)
+{
+	BlockNumber blkno;
+	BlockNumber lastblock;
+	BlockNumber	endingBlkno;
+	Buffer		buffer;
+	SMgrRelation smgr;
+
+	Assert(startBlkno != P_NEW);
+	Assert(uptoBlkno != P_NEW);
+
+	/* Open the relation at smgr level */
+	smgr = smgropen(rnode, InvalidBackendId);
+
+	/*
+	 * Create the target file if it doesn't already exist.  This lets us cope
+	 * if the replay sequence contains writes to a relation that is later
+	 * deleted.  (The original coding of this routine would instead suppress
+	 * the writes, but that seems like it risks losing valuable data if the
+	 * filesystem loses an inode during a crash.  Better to write the data
+	 * until we are actually told to delete the file.)
+	 */
+	smgrcreate(smgr, forkNum, true);
+
+	lastblock = smgrnblocks(smgr, forkNum);
+
+	endingBlkno = uptoBlkno;
+	if (lastblock < endingBlkno)
+		endingBlkno = lastblock;
+
+	for (blkno = startBlkno; blkno < endingBlkno; blkno++)
+	{
+		/*
+		 * All we need to do here is prove that we can lock each block
+		 * with a cleanup lock. It's not an error to see all-zero pages
+		 * here because the original btvacuumpage would not have thrown
+		 * an error either.
+		 *
+		 * We don't actually need to read the block, we just need to
+		 * confirm it is unpinned.
+		 */
+		buffer = GetBufferWithoutRelcache(rnode, forkNum, blkno);
+		if (BufferIsValid(buffer))
+		{
+			LockBufferForCleanup(buffer);
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+}
 
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..d3c7eb7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -222,8 +222,6 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * current physical EOF; that

Re: [HACKERS] WAL replay bugs

2014-07-03 Thread Kyotaro HORIGUCHI
Hello, This is the additional comments for other part.

I haven't see significant defect in the code so far.


= bufcapt.c: 

- buffer_capture_remember() or so.

 Pages for unlogged tables are avoided to be written taking
 advantage that the lsn for such pages stays 0/0. I'd like to see
 a comment mentioning for, say, buffer_capture_is_changed? or
 buffer_capture_forget or somewhere.

- buffer_capture_forget()

 However this error is likely not to occur, in the error message
 "could not find image...", the buffer id seems to bring no
 information. LSN, or quadraplet of LSN, rnode, forknum and
 blockno seems to be more informative.

- buffer_capture_is_changed()

 The name for the function semes to be misleading. What this
 function does is comparing LSNs between stored page image and
 current page.  lsn_is_changed(BufferImage) or something like
 would be more clearly.

== bufmgr.c:

- ConditionalLockBuffer()

 Sorry for a trivial comment, but the variable 'res' conceales
 the meaning. "acquired" seems to be more preferable, isn't it?

- LockBuffer()

 There is a 'XXX' comment.  The discussion written there seems to
 be right, for me. If you mind that peeking into there is a bad
 behavior, adding a macro into lwlock.h would help:p

 lwlock.h: #define LWLockHoldedExclusive(l) ((l)->exclusive > 0)
 lwlock.h: #define LWLockHoldedShared(l) ((l)->shared > 0)

# I don't see this usable so much..
 
 bufmgr.c: if (LWLockHoldedExclusive(buf->content_lock))

 If there isn't any particular concern, 'XXX:' should be removed.

= bufpage.c:

-  #include "storage/bufcapt.h"

  The include seems to be needless.

= bufcapt.h:

- header comment

  The file name is misspelled as 'bufcaptr.h'.
  Copyright notice of UC is needed? (Other files are the same)

- The includes in this header except for buf.h seem not to be
  necessary.

= init_env.sh:

- pg_get_test_port()

  It determines server port using PG_VERSION_NUM, but is it
  necessary? Addition to it, the theoretical maximum initial
  PGPORT semes to be 65535 but this script search for free port
  up to the number larger by 16 from the start, which would be
  beyond the edge.

- pg_get_test_port()

  It stops with error after 16 times of failure, but the message
  complains only about the final attempt. If you want to mention
  the port numbers, it might should be 'port $PGPORTSTART-$PGPORT
  ..' or would be 'All 16 ports attempted failed' or something..



regards,


-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD

2014-07-03 Thread Tatsuo Ishii
Hi,

> Hi,
> 
> Attached you can find a short (compile tested only ) patch implementing
> a 'shared_memory_type' GUC, akin to 'dynamic_shared_memory_type'. Will
> only apply to 9.4, not 9.3, but it should be easy to convert for it.
> 
> Greetings,
> 
> Andres Freund

I have rebased Andres's patch against 9.3-STABLE tree. Please take a
look at attached patches and let me know if you find anything strange.

I am going to test the patch on a huge HP machine: DL980G7/64
cores/2TB mem.  With the patch I would be able to report back if using
SysV shared mem fixes the 9.3 performance problem.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f746c81..e82054a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -72,6 +72,7 @@ static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
 static PGShmemHeader *PGSharedMemoryAttach(IpcMemoryKey key,
 	 IpcMemoryId *shmid);
+static void *CreateAnonymousSegment(Size *size);
 
 
 /*
@@ -389,49 +390,19 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
 	 * developer use, this shouldn't be a big problem.
 	 */
 #ifndef EXEC_BACKEND
+	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		long		pagesize = sysconf(_SC_PAGE_SIZE);
-
-		/*
-		 * Ensure request size is a multiple of pagesize.
-		 *
-		 * pagesize will, for practical purposes, always be a power of two.
-		 * But just in case it isn't, we do it this way instead of using
-		 * TYPEALIGN().
-		 */
-		if (pagesize > 0 && size % pagesize != 0)
-			size += pagesize - (size % pagesize);
-
-		/*
-		 * We assume that no one will attempt to run PostgreSQL 9.3 or later
-		 * on systems that are ancient enough that anonymous shared memory is
-		 * not supported, such as pre-2.4 versions of Linux.  If that turns
-		 * out to be false, we might need to add a run-time test here and do
-		 * this only if the running kernel supports it.
-		 */
-		AnonymousShmem = mmap(NULL, size, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS,
-			  -1, 0);
-		if (AnonymousShmem == MAP_FAILED)
-		{
-			int		saved_errno = errno;
-
-			ereport(FATAL,
-	(errmsg("could not map anonymous shared memory: %m"),
-	 (saved_errno == ENOMEM) ?
-errhint("This error usually means that PostgreSQL's request "
-	 "for a shared memory segment exceeded available memory "
-	  "or swap space. To reduce the request size (currently "
-	  "%lu bytes), reduce PostgreSQL's shared memory usage, "
-		"perhaps by reducing shared_buffers or "
-		"max_connections.",
-		(unsigned long) size) : 0));
-		}
+		AnonymousShmem = CreateAnonymousSegment(&size);
 		AnonymousShmemSize = size;
-
 		/* Now we need only allocate a minimal-sized SysV shmem block. */
 		sysvsize = sizeof(PGShmemHeader);
 	}
+	else
 #endif
+	{
+		Assert(shared_memory_type == SHMEM_TYPE_SYSV);
+		sysvsize = size;
+	}
 
 	/* Make sure PGSharedMemoryAttach doesn't fail without need */
 	UsedShmemSegAddr = NULL;
@@ -631,3 +602,47 @@ PGSharedMemoryAttach(IpcMemoryKey key, IpcMemoryId *shmid)
 
 	return hdr;
 }
+
+/*
+ * Creates an anonymous mmap()ed shared memory segment.
+ *
+ * Pass the requested size in *size.  This function will modify *size to the
+ * actual size of the allocation, if it ends up allocating a segment that is
+ * larger than requested.
+ */
+#ifndef EXEC_BACKEND
+static void *
+CreateAnonymousSegment(Size *size)
+{
+	Size		allocsize = *size;
+	void	   *ptr = MAP_FAILED;
+	int			mmap_errno = 0;
+
+	/*
+	 * use the original size, not the rounded up value, when falling back
+	 * to non-huge pages.
+	 */
+	allocsize = *size;
+	ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+			   PG_MMAP_FLAGS, -1, 0);
+	mmap_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		errno = mmap_errno;
+		ereport(FATAL,
+(errmsg("could not map anonymous shared memory: %m"),
+ (mmap_errno == ENOMEM) ?
+ errhint("This error usually means that PostgreSQL's request "
+	"for a shared memory segment exceeded available memory, "
+	  "swap space or huge pages. To reduce the request size "
+		 "(currently %zu bytes), reduce PostgreSQL's shared "
+	   "memory usage, perhaps by reducing shared_buffers or "
+		 "max_connections.",
+		 *size) : 0));
+	}
+
+	*size = allocsize;
+	return ptr;
+}
+#endif
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 918ac51..51dccdc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -39,6 +39,8 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 
+/* GUCs */
+int			shared_memory_type = DEFAULT_SHARED_MEMORY_TYPE;
 
 shmem_startup_hook_type shmem_startup_hook = NULL;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2b6527f..2945a68 100644
--- a/src/backend/utils/misc/guc.c
+++ b

Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Gavin Flower

On 03/07/14 20:58, Fabien COELHO wrote:


Hello Gavin,


 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


I would suggest that probabilities should NEVER be expressed in 
percentages! As a percentage probability looks weird, and is never 
used for serious statistical work - in my experience at least.


I think probabilities should be expressed in the range 0 ... 1 - i.e. 
0.35 rather than 35%.


I could agree about the mathematics, but ISTM that "11.5%" is more 
readable and intuitive than "0.115".


I could change "probability" and replace it with "frequency" or maybe 
"occurence", what would you think about that?




You may well be hitting a situation, where you meet opposition whatever 
you do!  :-)


"frequency" implies a positive integer (though "relative frequency" 
might be okay) - and if you use "occurrence", someone else is bound to 
complain...


Though, I'd opt for "relative frequency", if you can't use values in the 
range 0 ... 1 for probabilities, if %'s are used - so long as it does 
not generate a flame war.


I suspect it may not be worth the grief to change.


Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Andreas Joseph Krogh
På torsdag 03. juli 2014 kl. 11:13:44, skrev Abhijit Menon-Sen <
a...@2ndquadrant.com >: At 2014-07-03 11:02:34 
+0200, andr...@visena.com wrote:
 >
 > Is it possible to set this somewhere without
 > recompiling PG?

 I'm afraid not.   Ok   -- Andreas Jospeh Krogh CTO / Partner - Visena AS 
Mobile: +47 909 56 963 andr...@visena.com  
www.visena.com     

Re: [HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Abhijit Menon-Sen
At 2014-07-03 11:02:34 +0200, andr...@visena.com wrote:
>
> Is it possible to set this somewhere without 
> recompiling PG?

I'm afraid not.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-07-03 Thread Rushabh Lathia
On Thu, Jul 3, 2014 at 6:56 AM, Ian Barwick  wrote:

> On 01/07/14 21:00, Rushabh Lathia wrote:
>
>>
>> I spent some more time on the patch and here are my review comments.
>>
>> .) Patch gets applied through patch -p1 (git apply fails)
>>
>> .) trailing whitespace in the patch at various places
>>
>
> Not sure where you see this, unless it's in the tests, where it's
> required.
>
>
Please ignore the whitespace comment. Don't know somehow that was due to
my local changes in the branch.


>
>  .) Unnecessary new line + and - in the patch.
>> (src/backend/rewrite/rewriteManip.c::getInsertSelectQuery())
>> (src/include/rewrite/rewriteManip.h)
>>
>
> Fixed.
>
>
>  .) patch has proper test coverage and regression running cleanly.
>>
>> .) In map_primary_key_to_list() patch using INDEX_ATTR_BITMAP_IDENTITY_KEY
>> bitmap to get the keycols. In IndexAttrBitmapKind there is also
>> INDEX_ATTR_BITMAP_KEY, so was confuse between INDEX_ATTR_BITMAP_KEY and
>> INDEX_ATTR_BITMAP_IDENTITY_KEY and also haven't found related comments in
>> the code. Later with use of testcase and debugging found confirmed that
>> INDEX_ATTR_BITMAP_IDENTITY_KEY only returns the Primary key.
>>
>
> Revised patch version (see other mail) fixes this by introducing
> INDEX_ATTR_BITMAP_PRIMARY_KEY.
>
>
Looks good.


>
>  .) At present in patch when RETURNING PRIMARY KEY is used on table having
>> no
>> primary key it throw an error. If I am not wrong JDBC will be using that
>> into
>> getGeneratedKeys(). Currently this feature is implemented in the JDBC
>> driver by
>> appending "RETURNING *" to the supplied statement. With this
>> implementation
>> it will replace "RETURNING *" with "RETURNING PRIMARY KEY", right ? So
>> just
>> wondering what JDBC expect getGeneratedKeys() to return when query don't
>> have primary key and user called executeUpdate() with
>> Statement.RETURN_GENERATED_KEYS? I looked at JDBC specification but its
>> not
>> clear what it will return when table don't have keys. Can you please let
>> us
>> know your finding on this ?
>>
>
> The spec [*] is indeed frustratingly vague:
>
> The method Statement.getGeneratedKeys, which can be called to retrieve
> the generated
> value, returns a ResultSet object with a column for each automatically
> generated value.
> The methods execute, executeUpdate or Connection.prepareStatement
> accept an optional
> parameter, which can be used to indicate that any auto generated
> values should be
> returned when the statement is executed or prepared.
>
> [*] http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-
> spec/jdbc4.1-fr-spec.pdf
>
> I understand this to mean that no rows will be returned if no
> auto-generated values
> are not present.
>
> As-is, the patch will raise an error if the target table does not have a
> primary key,
> which makes sense from the point of view of the proposed syntax, but which
> will
> make it impossible for the JDBC driver to implement the above
> understanding of the spec
> (i.e. return nothing if no primary key exists).
>
> The basic reason for this patch is to allow JDBC to use this syntax for
getGeneratedKeys().
Now because this patch throw an error when table don't have any primary
key, this patch
won't be useful for the getGeneratedKeys() implementation.


> It would be simple enough not to raise an error in this case, but that
> means the
> query would be effectively failing silently and I don't think that's
> desirable
> behaviour.
>
>
Yes, not to raise an error won't be desirable behavior.


> A better solution would be to have an optional "IF EXISTS" clause:
>
>   RETURNING PRIMARY KEY [ IF EXISTS ]
>
> which would be easy enough to implement.
>
>
> Thoughts?
>
>
Hmm liked the idea about adding [ IF EXISTS ]. Looking at the grammar so
far we
having [ IF EXISTS ] option only for DDL commands (mainly DROP) not sure
whether
its ok to use such syntax for DML commands.

Others please share your thoughts/comments.


>
> Ian Barwick
>
> --
>  Ian Barwick   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>



-- 
Rushabh Lathia


[HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Andreas Joseph Krogh
Hi.   I'm up for testing 9.4 but my JDBC-driver fails to connect due to PG's 
minor-version string: "4beta1". Is it possible to set this somewhere without 
recompiling PG?   Thanks.   -- Andreas Jospeh Krogh CTO / Partner - Visena AS 
Mobile: +47 909 56 963 andr...@visena.com  
www.visena.com   

Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Fabien COELHO


Hello Gavin,


 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


I would suggest that probabilities should NEVER be expressed in percentages! 
As a percentage probability looks weird, and is never used for serious 
statistical work - in my experience at least.


I think probabilities should be expressed in the range 0 ... 1 - i.e. 0.35 
rather than 35%.


I could agree about the mathematics, but ISTM that "11.5%" is more 
readable and intuitive than "0.115".


I could change "probability" and replace it with "frequency" or maybe 
"occurence", what would you think about that?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Autonomous Transaction (WIP)

2014-07-03 Thread Amit Kapila
On Thu, Jul 3, 2014 at 12:03 PM, Rajeev rastogi 
wrote:
> On 01 July 2014 12:00, Amit Kapila Wrote:
> >Simon has mentioned that exactly this idea has been rejected at
>
> >PGCon 2 years back. Please refer that in below mail:
>
> >
http://www.postgresql.org/message-id/bf2827dcce55594c8d7a8f7ffd3ab7713dde1...@szxeml508-mbx.china.huawei.com
>
>
>
> >As far as I can see, you never came back with the different solution.
>
>
>
> Yeah right. So for this I tried to search archived mails to get the
details about the discussion but I could not find anything regarding design.
> So I am not sure how shall I make my solution different from earlier as
earlier solution is not accessible to me.

I haven't read your idea/patch in any detail, so can't comment
on whether it is good or bad.  However I think if one of the
Committers has already mentioned that the same idea has been
rejected previously, then it makes little sense to further review
or update the patch unless you know the reason of rejection and
handle it in an acceptable way.

Now as far as I can understand, the problem seems to be in a way
you have defined Autonomous Transaction Storage which can lead
to consumption of additional client slots, this is just what I could make
sense from above mail but I am not completely sure on this matter.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com