Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-10 Thread Fabien COELHO


Hello Robert,


I am not objecting to the functionality; I'm objecting to bolting on
ad-hoc operators one at a time.  I think an expression syntax would
let us do this in a much more scalable way.  If I had time, I'd go do
that, but I don't.  We could add abs(x) and hash(x) and it would all
be grand.


Ok. I do agree that an expression syntax would be great!

However, that would not diminish nor change much the amount and kind of 
code necessary to add an operator or a function: the parser would have to 
be updated, and the expression structure, and the executor. Currently the 
pgbench "parser" and expression are very limited, but they are also very 
generic so that nothing is needed to add a new operator there, only the 
execution code needs to be updated, and the update would be basically the 
same (if this is this function or operator, actually do it...) if the 
execution part of a real expression syntax would have to be updated.


So although I agree that a real expression syntax would be great "nice to 
have", I do not think that it is valid objection to block this patch.


Moreover, upgrading pgbench to handle an actual expression syntax is not 
so trivial, because for instance some parts of the text needs to be parsed 
(set syntax) while others would need not to be pased (SQL commands), so 
some juggling would be needed in the lexer, or maybe only call the parser 
on some lines and not others... Everything is possible, but this one would 
require some careful thinking.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors

2014-09-10 Thread Pavel Stehule
Hi Stephen

Can I help with something, it is there some open question?

Regards

Pavel

2014-09-08 6:27 GMT+02:00 Stephen Frost :

> * Pavel Stehule (pavel.steh...@gmail.com) wrote:
> > ignore_nulls in array_to_json_pretty probably is not necessary. On second
> > hand, the cost is zero, and we can have it for API consistency.
>
> I'm willing to be persuaded either way on this, really.  I do think it
> would be nice for both array_to_json and row_to_json to be single
> functions which take default values, but as for if array_to_json has a
> ignore_nulls option, I'm on the fence and would defer to people who use
> that function regularly (I don't today).
>
> Beyond that, I'm pretty happy moving forward with this patch.
>
> Thanks,
>
> Stephen
>


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-10 Thread Rahila Syed
>I will repeat the above tests with high load on CPU and using the benchmark
given by Fujii-san and post the results. 

Average % of CPU usage at user level for each of the compression algorithm
are as follows.

CompressionMultipleSingle

Off81.133881.1267
LZ4  81.099881.1695
Snappy:80.9741 80.9703
Pglz :81.235381.2753
 

 

  
 
The numbers show CPU utilization of Snappy is the least. The CPU utilization
in increasing order is
pglz > No compression > LZ4 > Snappy

The variance of average CPU utilization numbers is very low. However ,
snappy seems to be best when it comes to lesser utilization of CPU.

As per the measurement results posted till date 

LZ4 outperforms snappy and pglz in terms of compression ratio and
performance. However , CPU utilization numbers show snappy utilizes least
amount of CPU . Difference is not much though.

As there has been no consensus yet about which compression algorithm to
adopt, is it better to make this decision independent of the FPW compression
patch as suggested earlier in this thread?. FPW compression can be done
using built in compression pglz as it shows considerable performance over
uncompressed WAL and good compression ratio 
Also, the patch to compress multiple blocks at once gives better compression
as compared to single block. ISTM that performance overhead introduced by
multiple blocks compression is slightly higher than single block compression
which can be tested again after modifying the patch to use pglz .  Hence,
this patch can be built using multiple blocks compression.

Thoughts?



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aussie timezone database changes incoming

2014-09-10 Thread Andrew McNamara
>The Russian changes are perhaps not such a big deal because they've
>done that sort of thing before, but this is an earful:
>
> Australian eastern time zone abbreviations are now AEST/AEDT not
> EST, and similarly for the other Australian zones.  That is, for
> eastern standard and daylight saving time the abbreviations are AEST
> and AEDT instead of the former EST for both; similarly, ACST/ACDT,
> ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST,
> CWST, and WST.  This change does not affect UTC offsets, only time
> zone abbreviations.  (Thanks to Rich Tibbett and many others.)
[...]
>Anyone from down under care to remark about the actual usage of old
>and new abbreviations?

AEST/AEDT/etc are the official abbreviations and are commonly used.
They have been increasingly used over the last 20 years or so, and the
EST/EDT stuff on the Olsen tz database has been a source of annoyance
for a very long time, eg:

http://thread.gmane.org/gmane.comp.time.tz/2262

Quite likely this change will break stuff, but my feeling is more people
will be cheering than screaming.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-10 Thread Amit Kapila
On Thu, Sep 11, 2014 at 9:10 AM, Dilip kumar  wrote:
>
>
> I have done the testing and behavior is as per expectation,
>
> Do we need to do some document change? I mean is this limitation on
windows is mentioned anywhere ?

I don't think currently such a limitation is mentioned in docs,
however I think we can update the docs at below locations:

1. In description of pg_start_backup in below page:
http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-ADMIN-BACKUP
2. In Explanation of Base Backup, basically under heading
Making a Base Backup Using the Low Level API at below
page:
http://www.postgresql.org/docs/devel/static/continuous-archiving.html#BACKUP-BASE-BACKUP

In general, we can explain about symlink_label file where ever
we are explaining about backup_label file.

If you think it is sufficient to explain about symlink_label in
above places, then I can update the patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Aussie timezone database changes incoming

2014-09-10 Thread Craig Ringer
On 09/10/2014 11:23 PM, Tom Lane wrote:
> In connection with a question asked today on pgsql-general, I had
> occasion to go check the release announcements for the IANA timezone
> database files, and it turns out that there are some big changes in
> 2014f:
> http://mm.icann.org/pipermail/tz-announce/2014-August/23.html
> 
> The Russian changes are perhaps not such a big deal because they've
> done that sort of thing before, but this is an earful:
> 
>  Australian eastern time zone abbreviations are now AEST/AEDT not
>  EST, and similarly for the other Australian zones.  That is, for
>  eastern standard and daylight saving time the abbreviations are AEST
>  and AEDT instead of the former EST for both; similarly, ACST/ACDT,
>  ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST,
>  CWST, and WST.  This change does not affect UTC offsets, only time
>  zone abbreviations.  (Thanks to Rich Tibbett and many others.)

Oh, lovely.

I shouldn't be surprised that Australia gets to change. While the cynic
in me thinks this is the usual USA-is-the-center-of-the-universe-ism, in
reality it makes sense given relative population and likely impact.

> I'm wondering how many Aussie applications are going to break when
> this goes in, and if we could/should do anything about it.  One idea
> that comes to mind is to create an "Australia_old" tznames file
> containing the current Aussie zone abbreviations, so as to provide
> an easy way to maintain backwards compatibility at need (you'd select
> that as your timezone_abbreviations GUC setting).
> 
> Anyone from down under care to remark about the actual usage of old
> and new abbreviations?

Most systems I see work in UTC, but I don't actually work with many
that're in Australia.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-10 Thread Dilip kumar
On 20 August 2014 19:49, Amit Kapila Wrote



> There are some comments I would like to share with you
>
>
>
> 1.  Rebase the patch to current GIT head.
>Done.

>> +  initStringInfo(&symlinkfbuf);
>>
>> I think declaration and initialization of symlinkfbuf string can 
>> be moved under #ifdef WIN32 compile time macro,
>>
>> for other platform it’s simply allocated and freed which can be avoided.
>Agreed, I have changed the patch as per your suggestion.

I have done the testing and behavior is as per expectation,
Do we need to do some document change? I mean is this limitation on windows is 
mentioned anywhere ?
If no change then i will move the patch to “Ready For Committer”.

Thanks & Regards,
Dilip



Re: [HACKERS] Support for N synchronous standby servers

2014-09-10 Thread Michael Paquier
On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
 wrote:
> On 08/28/2014 10:10 AM, Michael Paquier wrote:
>>
>> + #synchronous_standby_num = -1 # number of standbys servers using sync
>> rep
>
>
> To be honest, that's a horrible name for the GUC. Back when synchronous
> replication was implemented, we had looong discussions on this feature. It
> was called "quorum commit" back then. I'd suggest using the "quorum" term in
> this patch, too, that's a fairly well-known term in distributed computing
> for this.
I am open to any suggestions. Then what about the following parameter names?
- synchronous_quorum_num
- synchronous_standby_quorum
- synchronous_standby_quorum_num
- synchronous_quorum_commit

> When synchronous replication was added, quorum was left out to keep things
> simple; the current feature set was the most we could all agree on to be
> useful. If you search the archives for "quorum commit" you'll see what I
> mean. There was a lot of confusion on what is possible and what is useful,
> but regarding this particular patch: people wanted to be able to describe
> more complicated scenarios. For example, imagine that you have a master and
> two standbys in one the primary data center, and two more standbys in a
> different data center. It should be possible to specify that you must get
> acknowledgment from at least on standby in both data centers. Maybe you
> could hack that by giving the standbys in the same data center the same
> name, but it gets ugly, and it still won't scale to even more complex
> scenarios.

Currently two nodes can only have the same priority if they have the
same application_name, so we could for example add a new connstring
parameter called, let's say application_group, to define groups of
nodes that will have the same priority (if a node does not define
application_group, it defaults to application_name, if app_name is
NULL, well we don't care much it cannot be a sync candidate). That's a
first idea that we could use to control groups of nodes. And we could
switch syncrep.c to use application_group in s_s_names instead of
application_name. That would be backward-compatible, and could open
the door for more improvements for quorum commits as we could control
groups node nodes. Well this is a super-set of what application_name
can already do, but there is no problem to identify single nodes of
the same data center and how much they could be late in replication,
so I think that this would be really user-friendly. An idea similar to
that would be a base work for the next thing... See below.

Now, in your case the two nodes on the second data center need to have
the same priority either way. With this patch you can achieve that
with the same node name. Where things are not that cool with this
patch is something like that though:
- 5 slaves: 1 with master (node_local), 2 on a 2nd data center
(node_center1), 2 last on a 3rd data center (node_center2)
- s_s_num = 3
- s_s_names = 'node_local,node_center1,node_center2'

In this case the nodes have the following priority:
- node_local => 1
- the 2 nodes with node_center1 => 2
- the 2 nodes with node_center2 => 3
In this {1,2,2,3,3} schema, the patch makes system wait for
node_local, and the two nodes in node_center1 without caring about the
ones in node_center2 as it will pick up only the nodes with the
highest priority. If user expects the system to wait for a node in
node_center2 he'll be disappointed. That's perhaps where we could
improve things, by adding an extra parameter able to control the
priority ranks, say synchronous_priority_check:
- [absolute|individual], wait for the first s_s_num nodes having the
lowest priority, in this case we'll wait for {1,2,2}
- group: for only one node in the lowest s_s_num priorities, here
we'll wait for {1,2,3}
Note that we may not even need this parameter if we assume by default
that we wait for only one node in a given group that has the same
priority.

> Maybe that's OK - we don't necessarily need to solve all scenarios at once.
> But it's worth considering.

Parametrizing and coverage of the user expectations are tricky. Either
way not everybody can be happy :) There are even people unhappy now
because we can only define one single sync node.

> BTW, how does this patch behave if there are multiple standbys connected
> with the same name?

All the nodes have the same priority. For example in the case of a
cluster with 5 slaves having the same application name and s_s_num =3,
the first three nodes when scanning the WAL sender array are expected
to return a COMMIT before committing locally:
=# show synchronous_standby_num ;
 synchronous_standby_num
-
 3
(1 row)
=# show synchronous_standby_names ;
 synchronous_standby_names
---
 node
(1 row)
=# SELECT application_name, client_port,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority, sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, app

[HACKERS] about half processes are blocked by btree, btree is bottleneck?

2014-09-10 Thread Xiaoyulei
I use benchmarksql with more than 200 clients on pg 9.3.3. when the test is 
going on, I collect all the process stack. I found about 100 processes are 
blocked by btree insert. Another 100 are blocked by xloginsert.

Does btree has bad performance in concurrency scenarios?

Sum:66
#0  0x7f8273a77627 in semop () from /lib64/libc.so.6
#1  0x0060cda7 in PGSemaphoreLock ()
#2  0x006511a9 in LWLockAcquire ()
#3  0x004987f7 in _bt_relandgetbuf ()
#4  0x0049c116 in _bt_search ()
#5  0x00497e13 in _bt_doinsert ()
#6  0x0049af52 in btinsert ()
#7  0x0072dce4 in FunctionCall6Coll ()
#8  0x0049592e in index_insert ()
#9  0x00590ac5 in ExecInsertIndexTuples ()


Sum:36
#0  0x7f8273a77627 in semop () from /lib64/libc.so.6
#1  0x0060cda7 in PGSemaphoreLock ()
#2  0x006511a9 in LWLockAcquire ()
#3  0x00497e31 in _bt_doinsert ()
#4  0x0049af52 in btinsert ()
#5  0x0072dce4 in FunctionCall6Coll ()
#6  0x0049592e in index_insert ()
#7  0x00590ac5 in ExecInsertIndexTuples ()


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccuracy in VACUUM's tuple count estimates

2014-09-10 Thread Bruce Momjian
On Thu, Jun 12, 2014 at 01:40:59PM +0200, Andres Freund wrote:
> Hi Tom,
> 
> On 2014-06-06 15:44:25 -0400, Tom Lane wrote:
> > I figured it'd be easy enough to get a better estimate by adding another
> > counter to count just LIVE and INSERT_IN_PROGRESS tuples (thus effectively
> > assuming that in-progress inserts and deletes will both commit).
> 
> Did you plan to backpatch that? My inclination would be no...
> 
> >  I did
> > that, and found that it helped Tim's test case not at all :-(.  A bit of
> > sleuthing revealed that HeapTupleSatisfiesVacuum actually returns
> > INSERT_IN_PROGRESS for any tuple whose xmin isn't committed, regardless of
> > whether the transaction has since marked it for deletion:
> > 
> > /*
> >  * It'd be possible to discern between INSERT/DELETE in progress
> >  * here by looking at xmax - but that doesn't seem beneficial 
> > for
> >  * the majority of callers and even detrimental for some. We'd
> >  * rather have callers look at/wait for xmin than xmax. It's
> >  * always correct to return INSERT_IN_PROGRESS because that's
> >  * what's happening from the view of other backends.
> >  */
> > return HEAPTUPLE_INSERT_IN_PROGRESS;
> > 
> > It did not use to blow this question off: back around 8.3 you got
> > DELETE_IN_PROGRESS if the tuple had a delete pending.  I think we need
> > less laziness + fuzzy thinking here.  Maybe we should have a separate
> > HEAPTUPLE_INSERT_AND_DELETE_IN_PROGRESS result code?  Is it *really*
> > the case that callers other than VACUUM itself are okay with failing
> > to make this distinction?  I'm dubious: there are very few if any
> > callers that treat the INSERT and DELETE cases exactly alike.
> 
> My current position on this is that we should leave the code as is <9.4
> and HEAPTUPLE_INSERT_IN_PROGRESS for the 9.4/master. Would you be ok
> with that? The second best thing imo would be to discern and return
> HEAPTUPLE_INSERT_IN_PROGRESS/HEAPTUPLE_DELETE_IN_PROGRESS for the
> respective cases.
> Which way would you like to go?

Did we ever come to a conclusion on this?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "cancelling statement due to user request error" occurs but the transaction has committed.

2014-09-10 Thread Bruce Momjian
On Tue, Jun 10, 2014 at 10:30:24AM -0400, Robert Haas wrote:
> On Tue, Jun 10, 2014 at 10:18 AM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> I don't agree with this analysis.  If the connection is closed after
> >> the client sends a COMMIT and before it gets a response, then the
> >> client must indeed be smart enough to figure out whether or not the
> >> commit happened.  But if the server sends a response, the client
> >> should be able to rely on that response being correct.  In this case,
> >> an ERROR is getting sent but the transaction is getting committed;
> >> yuck.  I'm not sure whether the fix is right, but this definitely
> >> seems like a bug.
> >
> > In general, the only way to avoid that sort of behavior for a post-commit
> > error would be to PANIC ... and even then, the transaction got committed,
> > which might not be the expectation of a client that got an error message,
> > even if it said PANIC.  So this whole area is a minefield, and the only
> > attractive thing we can do is to try to reduce the number of errors that
> > can get thrown post-commit.  We already, for example, do not treat
> > post-commit file unlink failures as ERROR, though we surely would prefer
> > to do that.
> 
> We could treated it as a lost-communication scenario.  The appropriate
> recovery actions from the client's point of view are identical.
> 
> > So from this standpoint, redefining SIGINT as not throwing an error when
> > we're in post-commit seems like a good idea.  I'm not endorsing any
> > details of the patch here, but the 2-foot view seems generally sound.
> 
> Cool, that makes sense to me also.

Did we ever do anything about this?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_control is missing a field for LOBLKSIZE

2014-09-10 Thread Bruce Momjian
On Tue, Jun 17, 2014 at 08:46:00PM -0400, Bruce Momjian wrote:
> On Tue, Jun 17, 2014 at 07:12:16PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > On Tue, Jun 17, 2014 at 03:55:02PM -0400, Alvaro Herrera wrote:
> > >> Can't you compare it to the historic default value?  I mean, add an
> > >> assumption that people thus far has never tweaked it.
> > 
> > > Well, if they did tweak it, then they would be unable to use pg_upgrade
> > > because it would complain about a mismatch if they actually matched the
> > > old and new servers.
> > 
> > What about comparing to the symbolic value LOBLKSIZE?  This would make
> > pg_upgrade assume that the old installation had been tweaked the same
> > as in its own build.  This ends up being the same as what you said,
> > ie, effectively no comparison ... but it might be less complicated to
> > code/understand.
> 
> OK, assume the compiled-in default is the value for an old cluster that
> has no value --- yeah, I could do that.

Turns out I already had values that could be missing in the old cluster,
so I just used the same format for this, rather than testing for
LOBLKSIZE.

Attached patch applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/contrib/pg_upgrade/controldata.c b/contrib/pg_upgrade/controldata.c
new file mode 100644
index 9282b8e..d105a59
*** a/contrib/pg_upgrade/controldata.c
--- b/contrib/pg_upgrade/controldata.c
*** get_control_data(ClusterInfo *cluster, b
*** 54,59 
--- 54,60 
  	bool		got_ident = false;
  	bool		got_index = false;
  	bool		got_toast = false;
+ 	bool		got_large_object = false;
  	bool		got_date_is_int = false;
  	bool		got_float8_pass_by_value = false;
  	bool		got_data_checksum_version = false;
*** get_control_data(ClusterInfo *cluster, b
*** 357,362 
--- 358,374 
  			cluster->controldata.toast = str2uint(p);
  			got_toast = true;
  		}
+ 		else if ((p = strstr(bufin, "Size of a large-object chunk:")) != NULL)
+ 		{
+ 			p = strchr(p, ':');
+ 
+ 			if (p == NULL || strlen(p) <= 1)
+ pg_fatal("%d: controldata retrieval problem\n", __LINE__);
+ 
+ 			p++;/* removing ':' char */
+ 			cluster->controldata.large_object = str2uint(p);
+ 			got_large_object = true;
+ 		}
  		else if ((p = strstr(bufin, "Date/time type storage:")) != NULL)
  		{
  			p = strchr(p, ':');
*** get_control_data(ClusterInfo *cluster, b
*** 475,480 
--- 487,494 
  		!got_tli ||
  		!got_align || !got_blocksz || !got_largesz || !got_walsz ||
  		!got_walseg || !got_ident || !got_index || !got_toast ||
+ 		(!got_large_object &&
+ 		 cluster->controldata.cat_ver >= LARGE_OBJECT_SIZE_PG_CONTROL_VER) ||
  		!got_date_is_int || !got_float8_pass_by_value || !got_data_checksum_version)
  	{
  		pg_log(PG_REPORT,
*** get_control_data(ClusterInfo *cluster, b
*** 527,532 
--- 541,550 
  		if (!got_toast)
  			pg_log(PG_REPORT, "  maximum TOAST chunk size\n");
  
+ 		if (!got_large_object &&
+ 			cluster->controldata.cat_ver >= LARGE_OBJECT_SIZE_PG_CONTROL_VER)
+ 			pg_log(PG_REPORT, "  large-object chunk size\n");
+ 
  		if (!got_date_is_int)
  			pg_log(PG_REPORT, "  dates/times are integers?\n");
  
*** check_control_data(ControlData *oldctrl,
*** 576,581 
--- 594,602 
  	if (oldctrl->toast == 0 || oldctrl->toast != newctrl->toast)
  		pg_fatal("old and new pg_controldata maximum TOAST chunk sizes are invalid or do not match\n");
  
+ 	if (oldctrl->large_object == 0 || oldctrl->large_object != newctrl->large_object)
+ 		pg_fatal("old and new pg_controldata large-object chunk sizes are invalid or do not match\n");
+ 
  	if (oldctrl->date_is_int != newctrl->date_is_int)
  		pg_fatal("old and new pg_controldata date/time storage types do not match\n");
  
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index 1ac3394..0207391
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
*** extern char *output_files[];
*** 116,121 
--- 116,127 
  #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
  
  /*
+  * large object chunk size added to pg_controldata,
+  * commit 5f93c37805e7485488480916b4585e098d3cc883
+  */
+ #define LARGE_OBJECT_SIZE_PG_CONTROL_VER 942
+ 
+ /*
   * Each relation is represented by a relinfo structure.
   */
  typedef struct
*** typedef struct
*** 203,208 
--- 209,215 
  	uint32		ident;
  	uint32		index;
  	uint32		toast;
+ 	uint32		large_object;
  	bool		date_is_int;
  	bool		float8_pass_by_value;
  	bool		data_checksum_version;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup failed to back up large file

2014-09-10 Thread Bruce Momjian
On Mon, Jun  9, 2014 at 11:17:41AM +0200, Magnus Hagander wrote:
> On Wednesday, June 4, 2014, Tom Lane  wrote:
> 
> Magnus Hagander  writes:
> > On Tue, Jun 3, 2014 at 6:38 PM, Tom Lane  wrote:
> >> Another thought is we could make pg_basebackup simply skip any files
> that
> >> exceed RELSEG_SIZE, on the principle that you don't really need/want
> >> enormous log files to get copied anyhow.  We'd still need the pax
> >> extension if the user had configured large RELSEG_SIZE, but having a
> >> compatible tar could be documented as a requirement of doing that.
> 
> > I think going all the way to pax is the proper long-term thing to do, at
> > least if we can confirm it works in the main tar implementations.
> 
> > For backpatchable that seems more reasonable. It doesn't work today, and
> we
> > just need to document that it doesn't, with larger RELSEG_SIZE. And then
> > fix it properly for the future.
> 
> Agreed, this would be a reasonable quick fix that we could replace in
> 9.5 or later.
> 
> 
> 
> Fujii, are you going to be able to work on this with the now expanded scope? 
> :)

Is this a TODO or doc item?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.

2014-09-10 Thread Stephen Frost
* Vik Fearing (vik.fear...@dalibo.com) wrote:
> On 09/08/2014 06:17 AM, Stephen Frost wrote:
> > * Vik Fearing (vik.fear...@dalibo.com) wrote:
> >> On 09/02/2014 10:17 PM, Marko Tiikkaja wrote:
> >>> Yeah, I think I like this better than allowing all of them without the
> >>> database name.
> >>
> >> Why?  It's just a noise word!
> > 
> > Eh, because it ends up reindexing system tables too, which is probably
> > not what new folks are expecting.
> 
> No behavior is changed at all.  REINDEX DATABASE dbname; has always hit
> the system tables.  Since dbname can *only* be the current database,
> there's no logic nor benefit in requiring it to be specified.

Sure, but I think the point is that reindexing the system tables as part
of a database-wide reindex is a *bad* thing which we shouldn't be
encouraging by making it easier.

I realize you're a bit 'stuck' here because we don't like the current
behavior, but we don't want to change it either.

> > Also, it's not required when you say
> > 'user tables', so it's similar to your user_tables v1 patch in that
> > regard.
> 
> The fact that REINDEX USER TABLES; is the only one that doesn't require
> the dbname seems very inconsistent and confusing.

I understand, but the alternative would be a 'reindex;' which *doesn't*
reindex the system tables- would that be less confusing?  Or getting rid
of the current 'reindex database' which also reindexes system tables...

> >> Yes, I will update the patch.
> > 
> > Still planning to do this..?
> > 
> > Marking this back to waiting-for-author.
> 
> Yes, but probably not for this commitfest unfortunately.

Fair enough, I'll mark it 'returned with feedback'.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.

2014-09-10 Thread Vik Fearing
On 09/08/2014 06:17 AM, Stephen Frost wrote:
> * Vik Fearing (vik.fear...@dalibo.com) wrote:
>> On 09/02/2014 10:17 PM, Marko Tiikkaja wrote:
>>> Yeah, I think I like this better than allowing all of them without the
>>> database name.
>>
>> Why?  It's just a noise word!
> 
> Eh, because it ends up reindexing system tables too, which is probably
> not what new folks are expecting.

No behavior is changed at all.  REINDEX DATABASE dbname; has always hit
the system tables.  Since dbname can *only* be the current database,
there's no logic nor benefit in requiring it to be specified.

> Also, it's not required when you say
> 'user tables', so it's similar to your user_tables v1 patch in that
> regard.

The fact that REINDEX USER TABLES; is the only one that doesn't require
the dbname seems very inconsistent and confusing.

>> Yes, I will update the patch.
> 
> Still planning to do this..?
> 
> Marking this back to waiting-for-author.

Yes, but probably not for this commitfest unfortunately.
-- 
Vik


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Need Multixact Freezing Docs

2014-09-10 Thread Bruce Momjian
On Fri, Sep  5, 2014 at 07:39:36PM -0400, Bruce Momjian wrote:
> On Wed, Sep  3, 2014 at 05:17:17PM -0400, Robert Haas wrote:
> > > I had a look at this and came upon a problem --- there is no multi-xid
> > > SQL data type, and in fact the system catalogs that store mxid values
> > > use xid, e.g.:
> > >
> > >  relminmxid | xid   | not null
> > >
> > > With no mxid data type, there is no way to do function overloading to
> > > cause age to call the mxid variant.
> > >
> > > Should we use an explicit mxid_age() function name?  Add an mxid data
> > > type?
> > 
> > Maybe both.  But mxid_age() is probably the simpler way forward just to 
> > start.
> 
> OK, patch applied using mxid_age() and no new data type.

Applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra
On 10.9.2014 21:34, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra  wrote:
>> On 10.9.2014 20:25, Heikki Linnakangas wrote:
>>> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
 I also did a few 'minor' changes to the dense allocation patch, most
 notably:

 * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.
>>>
>>> That's too easy to confuse with regular MemoryContext and AllocChunk
>>> stuff. I renamed it to HashMemoryChunk.
>>
>> BTW this breaks the second patch, which is allocating new chunks when
>> resizing the hash table. Should I rebase the patch, or can the commiter
>> do s/MemoryChunk/HashMemoryChunk/ ?
>>
>> Assuming there are no more fixes needed, of course.
> 
> Rebasing it will save the committer time, which will increase the
> chances that one will look at your patch.  So it's highly recommended.

OK. So here's v13 of the patch, reflecting this change.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+ (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..4f00426 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
+
 /* 
  *		ExecHash
  *
@@ -117,6 +121,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +132,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;

Re: [HACKERS] Memory Alignment in Postgres

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 4:29 PM, Bruce Momjian  wrote:
> On Wed, Sep 10, 2014 at 11:43:52AM -0400, Robert Haas wrote:
>> But there are a couple of obvious problems with this idea, too, such as:
>>
>> 1. It's really complicated and a ton of work.
>> 2. It would break pg_upgrade pretty darn badly unless we employed some
>> even-more-complex strategy to mitigate that.
>> 3. The savings might not be enough to justify the effort.
>>
>> It might be interesting for someone to develop a tool measuring the
>> number of bytes of alignment padding we lose per tuple or per page and
>> gather some statistics on it on various databases.  That would give us
>> some sense as to the possible savings.
>
> And will we ever implement a logical attribute system so we can reorder
> the stored attribtes to minimize wasted space?

You forgot to attach the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 4:01 PM, Robert Haas  wrote:
>> Yes, although my issue with the hooks was not that you only provided them
>> for 2 functions, but the fact that it had no structure and the
>> implementation was "if hook set do this, else do that" which I don't see
>> like a good way of doing it.
>
> We've done it that way in a bunch of other places, like ExecutorRun().
> An advantage of this approach (I think) is that jumping to a fixed
> address is faster than jumping through a function pointer - so with
> the approach I've taken here, the common case where we're talking to
> the client incurs only the overhead of a null-test, and the larger
> overhead of the function pointer jump is incurred only when the hook
> is in use.  Maybe that's not enough of a difference to matter to
> anything, but I think the contention that I've invented some novel
> kind of interface here doesn't hold up to scrutiny.  We have lots of
> hooks that work just like what I did here.

Here's what the other approach looks like.  I can't really see doing
this way and then only providing hooks for those two functions, so
this is with hooks for all the send-side stuff.

Original version: 9 files changed, 295 insertions(+), 3 deletions(-)
This version: 9 files changed, 428 insertions(+), 47 deletions(-)

There is admittedly a certain elegance to providing a complete set of
hooks, so maybe this is the way to go.  The remaining patches in the
patch series work with either the old version or this one; the changes
here don't affect anything else.

Anyone else have an opinion on which way is better here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit 76b9b171f81b1639e9d0379f5066212107a740f6
Author: Robert Haas 
Date:   Wed May 28 19:30:21 2014 -0400

Support frontend-backend protocol communication using a shm_mq.

A background worker can use pq_redirect_to_shm_mq() to direct protocol
that would normally be sent to the frontend to a shm_mq so that another
process may read them.

The receiving process may use pq_parse_errornotice() to parse an
ErrorResponse or NoticeResponse from the background worker and, if
it wishes, ThrowErrorData() to propagate the error (with or without
further modification).

V2: Lots more hooks, and a struct!

diff --git a/src/backend/libpq/Makefile b/src/backend/libpq/Makefile
index 8be0572..09410c4 100644
--- a/src/backend/libpq/Makefile
+++ b/src/backend/libpq/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 # be-fsstubs is here for historical reasons, probably belongs elsewhere
 
 OBJS = be-fsstubs.o be-secure.o auth.o crypt.o hba.o ip.o md5.o pqcomm.o \
-   pqformat.o pqsignal.o
+   pqformat.o pqmq.o pqsignal.o
 
 ifeq ($(with_openssl),yes)
 OBJS += be-secure-openssl.o
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 605d891..7c4252d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -102,7 +102,6 @@
 int			Unix_socket_permissions;
 char	   *Unix_socket_group;
 
-
 /* Where the Unix socket files are (list of palloc'd strings) */
 static List *sock_paths = NIL;
 
@@ -134,16 +133,38 @@ static bool DoingCopyOut;
 
 
 /* Internal functions */
-static void pq_close(int code, Datum arg);
+static void socket_comm_reset(void);
+static void socket_close(int code, Datum arg);
+static void socket_set_nonblocking(bool nonblocking);
+static int	socket_flush(void);
+static int	socket_flush_if_writable(void);
+static bool socket_is_send_pending(void);
+static int	socket_putmessage(char msgtype, const char *s, size_t len);
+static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
+static void socket_startcopyout(void);
+static void socket_endcopyout(bool errorAbort);
 static int	internal_putbytes(const char *s, size_t len);
 static int	internal_flush(void);
-static void pq_set_nonblocking(bool nonblocking);
+static void socket_set_nonblocking(bool nonblocking);
 
 #ifdef HAVE_UNIX_SOCKETS
 static int	Lock_AF_UNIX(char *unixSocketDir, char *unixSocketPath);
 static int	Setup_AF_UNIX(char *sock_path);
 #endif   /* HAVE_UNIX_SOCKETS */
 
+PQsendMethods PQsendSocketMethods;
+
+static PQsendMethods PqSendSocketMethods = {
+	socket_comm_reset,
+	socket_flush,
+	socket_flush_if_writable,
+	socket_is_send_pending,
+	socket_putmessage,
+	socket_putmessage_noblock,
+	socket_startcopyout,
+	socket_endcopyout
+};
+
 
 /* 
  *		pq_init - initialize libpq at backend startup
@@ -152,24 +173,25 @@ static int	Setup_AF_UNIX(char *sock_path);
 void
 pq_init(void)
 {
+	PqSendMethods = &PqSendSocketMethods;
 	PqSendBufferSize = PQ_SEND_BUFFER_SIZE;
 	PqSendBuffer = MemoryContextAlloc(TopMemoryContext, PqSendBufferSize);
 	PqSendPointer = PqSendStart = PqRecvPointer = PqRecvLength = 0;
 	PqCommBusy = false;
 	DoingCopyOut = false;
-	on_proc_exit(pq_close, 0);
+	on_proc_exit(socket_close, 0);
 }
 
 /* -

[HACKERS] Commitfest status

2014-09-10 Thread Heikki Linnakangas
Another commitfest week has passed, and here are still. There are now 24 
patches in "Needs Review" state, and 8 in "Ready for Committer". I'm not 
paying attention to the "Waiting on Author" patches - once we're close 
to zero on the other patches, those will be bounced back.


The good news is that all but two patches have a reviewer assigned. The 
bad news is that the rest don't seem to moving graduating from the Needs 
Review state.


Reviewers: please review your patches. And then pick another patch to 
review; one of the two that have no reviewer assigned yet, or some other 
patch. It is only good to have more than on reviewer for the same patch.


Patch authors: if your patch is not getting reviewed in a timely 
fashion, in a few days, please send an off-list note to the reviewer and 
ask what the status is. The reviewer might not realize that you're waiting.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-09-10 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, Sep 10, 2014 at 9:47 AM, Alvaro Herrera
>  wrote:
> > I attach some additional minor suggestions to your patch.
> 
> I don't see any attachment.

Uh.  Here it is.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 51d1889..2b336b0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4090,7 +4090,7 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
  *	mode: indicates if shared or exclusive tuple lock is desired
- *	wait_policy: whether to block, ereport or skip if lock not available
+ *	wait_policy: what to do if tuple lock is not available
  *	follow_updates: if true, follow the update chain to also lock descendant
  *		tuples.
  *
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index e4065c2..a718c0b 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1928,6 +1928,9 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
  *
  * Returns a palloc'd copy of the newest tuple version, or NULL if we find
  * that there is no newest version (ie, the row was deleted not updated).
+ * We also return NULL if the tuple is locked and the wait policy is to skip
+ * such tuples.
+ *
  * If successful, we have locked the newest tuple version, so caller does not
  * need to worry about it changing anymore.
  *
@@ -1994,7 +1997,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		break;
 	case LockWaitSkip:
 		if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
-			return NULL; /* skipping instead of waiting */
+			return NULL; /* skip instead of waiting */
 		break;
 	case LockWaitError:
 		if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
@@ -2076,6 +2079,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 	/* tuple was deleted, so give up */
 	return NULL;
 
+case HeapTupleWouldBlock:
+	elog(WARNING, "this is an odd case");
+	return NULL;
+
 default:
 	ReleaseBuffer(buffer);
 	elog(ERROR, "unrecognized heap_lock_tuple status: %u",
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index f9feff4..990240b 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -179,7 +179,10 @@ lnext:
 
 if (copyTuple == NULL)
 {
-	/* Tuple was deleted or skipped (in SKIP LOCKED), so don't return it */
+	/*
+	 * Tuple was deleted; or it's locked and we're under SKIP
+	 * LOCKED policy, so don't return it
+	 */
 	goto lnext;
 }
 /* remember the actually locked tuple's TID */
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f6e9b0..1f66928 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -2515,11 +2515,12 @@ applyLockingClause(Query *qry, Index rtindex,
 		 * a shared and exclusive lock at the same time; it'll end up being
 		 * exclusive anyway.)
 		 *
-		 * We also consider that NOWAIT wins if it is specified multiple ways,
-		 * otherwise SKIP LOCKED wins. This is a bit more debatable but
-		 * raising an error doesn't seem helpful.  (Consider for instance
-		 * SELECT FOR UPDATE NOWAIT from a view that internally contains a
-		 * plain FOR UPDATE spec.)
+		 * Similarly, if the same RTE is specified with more than one lock wait
+		 * policy, consider that NOWAIT wins over SKIP LOCKED, which in turn
+		 * wins over waiting for the lock (the default).  This is a bit more
+		 * debatable but raising an error doesn't seem helpful.  (Consider for
+		 * instance SELECT FOR UPDATE NOWAIT from a view that internally
+		 * contains a plain FOR UPDATE spec.)
 		 *
 		 * And of course pushedDown becomes false if any clause is explicit.
 		 */
diff --git a/src/include/utils/lockwaitpolicy.h b/src/include/utils/lockwaitpolicy.h
index f8ad07e..7f9a693 100644
--- a/src/include/utils/lockwaitpolicy.h
+++ b/src/include/utils/lockwaitpolicy.h
@@ -1,34 +1,31 @@
 /*-
- *
  * lockwaitpolicy.h
- *	  Header file for the enum LockWaitPolicy enum, which is needed in
- *	  several modules (parser, planner, executor, heap manager).
+ *	  Header file for LockWaitPolicy enum.
  *
  * Copyright (c) 2014, PostgreSQL Global Development Group
  *
  * src/include/utils/lockwaitpolicy.h
- *
  *-
  */
 #ifndef LOCKWAITPOLICY_H
 #define LOCKWAITPOLICY_H
 
 /*
- * Policy for what to do when a row lock cannot be obtained immediately.
- *
- * The enum values defined here control how the parser treats multiple FOR
- * UPDATE/SHARE clauses that a

Re: [HACKERS] Memory Alignment in Postgres

2014-09-10 Thread Bruce Momjian
On Wed, Sep 10, 2014 at 11:43:52AM -0400, Robert Haas wrote:
> But there are a couple of obvious problems with this idea, too, such as:
> 
> 1. It's really complicated and a ton of work.
> 2. It would break pg_upgrade pretty darn badly unless we employed some
> even-more-complex strategy to mitigate that.
> 3. The savings might not be enough to justify the effort.
> 
> It might be interesting for someone to develop a tool measuring the
> number of bytes of alignment padding we lose per tuple or per page and
> gather some statistics on it on various databases.  That would give us
> some sense as to the possible savings.

And will we ever implement a logical attribute system so we can reorder
the stored attribtes to minimize wasted space?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-09-10 Thread Heikki Linnakangas

On 08/28/2014 10:10 AM, Michael Paquier wrote:

+ #synchronous_standby_num = -1 # number of standbys servers using sync rep


To be honest, that's a horrible name for the GUC. Back when synchronous 
replication was implemented, we had looong discussions on this feature. 
It was called "quorum commit" back then. I'd suggest using the "quorum" 
term in this patch, too, that's a fairly well-known term in distributed 
computing for this.


When synchronous replication was added, quorum was left out to keep 
things simple; the current feature set was the most we could all agree 
on to be useful. If you search the archives for "quorum commit" you'll 
see what I mean. There was a lot of confusion on what is possible and 
what is useful, but regarding this particular patch: people wanted to be 
able to describe more complicated scenarios. For example, imagine that 
you have a master and two standbys in one the primary data center, and 
two more standbys in a different data center. It should be possible to 
specify that you must get acknowledgment from at least on standby in 
both data centers. Maybe you could hack that by giving the standbys in 
the same data center the same name, but it gets ugly, and it still won't 
scale to even more complex scenarios.


Maybe that's OK - we don't necessarily need to solve all scenarios at 
once. But it's worth considering.


BTW, how does this patch behave if there are multiple standbys connected 
with the same name?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-10 Thread Robert Haas
On Tue, Sep 9, 2014 at 6:03 PM, Petr Jelinek  wrote:
>> - The remaining functions are pq_init(), pq_comm_reset(), pq_flush(),
>> pq_flush_if_writable(), pq_is_send_pending(), pq_putmessage(),
>> pq_putmessage_noblock(), pq_startcopyout(), and pq_endcopyout().
>> These are the ones that I think are potentially interesting.
>>
>> I didn't choose to provide hooks for all of these in the submitted
>> patch because they're not all needed for I want to do here:
>> pq_startcopyout() and pq_endcopyout() are only needed for V2 protocol
>> support, which did not interest me (nor did COPY, really);
>> pq_putmessage_noblock(), pq_flush_if_writable(), and
>> pq_is_send_pending() are only used for the walsender protocol, which
>> doesn't seem useful to redirect to a non-socket; and I just didn't
>> happen to have any use for pq_init() or pq_comm_reset().  Hence what I
>> ended up with.
>>
>> But, I could revisit that.  Suppose I provide a structure with 10
>> function pointers for all ten of those functions, or maybe 9 since
>> pq_init() is called so early that it's not likely we'll have control
>> to put the hooks in place before that point, and anyway whatever code
>> installs the hooks can do its own initialization then.  Then make a
>> global variable like pqSendMethods and #define pq_comm_reset() to be
>> pqSendMethods->comm_reset(), pflush() to be pqSendMethods->flush(),
>> and so on for all 9 or 10 methods.  Then the pqmq code could just
>> change pqSendMethods to point to its own method structure instead of
>> the default one.  Would that address the concern this concern?  It's
>> more than 20 lines of code, but it's not TOO bad.
>>
>
> Yes, although my issue with the hooks was not that you only provided them
> for 2 functions, but the fact that it had no structure and the
> implementation was "if hook set do this, else do that" which I don't see
> like a good way of doing it.

We've done it that way in a bunch of other places, like ExecutorRun().
An advantage of this approach (I think) is that jumping to a fixed
address is faster than jumping through a function pointer - so with
the approach I've taken here, the common case where we're talking to
the client incurs only the overhead of a null-test, and the larger
overhead of the function pointer jump is incurred only when the hook
is in use.  Maybe that's not enough of a difference to matter to
anything, but I think the contention that I've invented some novel
kind of interface here doesn't hold up to scrutiny.  We have lots of
hooks that work just like what I did here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 9:47 AM, Alvaro Herrera
 wrote:
> I attach some additional minor suggestions to your patch.

I don't see any attachment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra  wrote:
> On 10.9.2014 20:25, Heikki Linnakangas wrote:
>> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>>> I also did a few 'minor' changes to the dense allocation patch, most
>>> notably:
>>>
>>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>>The original naming seemed a bit awkward.
>>
>> That's too easy to confuse with regular MemoryContext and AllocChunk
>> stuff. I renamed it to HashMemoryChunk.
>
> BTW this breaks the second patch, which is allocating new chunks when
> resizing the hash table. Should I rebase the patch, or can the commiter
> do s/MemoryChunk/HashMemoryChunk/ ?
>
> Assuming there are no more fixes needed, of course.

Rebasing it will save the committer time, which will increase the
chances that one will look at your patch.  So it's highly recommended.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2014-09-10 Thread Heikki Linnakangas

I had a cursory look at this patch and the discussions around this.

ISTM there are actually two new features in this: 1) allow CHECK 
constraints on foreign tables, and 2) support inheritance for foreign 
tables. How about splitting it into two?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra
On 10.9.2014 20:25, Heikki Linnakangas wrote:
> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>> I also did a few 'minor' changes to the dense allocation patch, most
>> notably:
>>
>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>The original naming seemed a bit awkward.
> 
> That's too easy to confuse with regular MemoryContext and AllocChunk
> stuff. I renamed it to HashMemoryChunk.

BTW this breaks the second patch, which is allocating new chunks when
resizing the hash table. Should I rebase the patch, or can the commiter
do s/MemoryChunk/HashMemoryChunk/ ?

Assuming there are no more fixes needed, of course.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra
On 10.9.2014 20:55, Heikki Linnakangas wrote:
> On 09/10/2014 09:31 PM, Robert Haas wrote:
 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
 threshold for 'oversized' items

 We need the threshold to be >=8kB, to trigger the special case
 within AllocSet. The 1/4 rule is consistent with
 ALLOC_CHUNK_FRACTION.
>>>
>>> Should we care about the fact that if there are only a few tuples, we
>>> will
>>> nevertheless waste 32kB of memory for the chunk? I guess not, but I
>>> thought
>>> I'd mention it. The smallest allowed value for work_mem is 64kB.
>>
>> I think we should change the threshold here to 1/8th.  The worst case
>> memory wastage as-is ~32k/5 > 6k.
> 
> Not sure how you arrived at that number. The worst case is that each 32k
> chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each
> chunk. That's 25% wastage. That's not too bad IMHO - the worst case
> waste of AllocSet is 50%.
> 
> But if you want to twiddle it, feel free. Note that there's some
> interplay with allocset code, like Tomas mentioned. If the threshold is
> < 8k, palloc() will round up request sizes smaller than 8k anyway. That
> wastes some memory, nothing more serious than that, but it seems like a
> good idea to avoid it.

Yes, and pfree then keeps these blocks, which means a chance of keeping
memory that won't be reused. That's why I chose the 8kB threshold. So
I'd keep the 32kB/8kB, as proposed in the patch.

But I don't see this as a big issue - my experience is that either we
have very much smaller than 4kB, or much larger tuples than 8kB. And
even for the tuples between 4kB and 8kB, the waste is not that bad.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 3:02 PM, Tomas Vondra  wrote:
> On 10.9.2014 20:31, Robert Haas wrote:
>> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
>>  wrote:
>>> The dense-alloc-v5.patch looks good to me. I have committed that with minor
>>> cleanup (more comments below). I have not looked at the second patch.
>>
>> Gah.  I was in the middle of doing this.  Sigh.
>>
 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>>>
>>> Should we care about the fact that if there are only a few tuples, we will
>>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
>>> I'd mention it. The smallest allowed value for work_mem is 64kB.
>>
>> I think we should change the threshold here to 1/8th.  The worst case
>> memory wastage as-is ~32k/5 > 6k.
>
> So you'd lower the threshold to 4kB? That may lower the wastage in the
> chunks, but palloc will actually allocate 8kB anyway, wasting up to
> additional 4kB. So I don't see how lowering the threshold to 1/8th
> improves the situation ...

Ah, OK.  Well, never mind then.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra
On 10.9.2014 20:31, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
>  wrote:
>> The dense-alloc-v5.patch looks good to me. I have committed that with minor
>> cleanup (more comments below). I have not looked at the second patch.
> 
> Gah.  I was in the middle of doing this.  Sigh.
> 
>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>>threshold for 'oversized' items
>>>
>>>We need the threshold to be >=8kB, to trigger the special case
>>>within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>>
>> Should we care about the fact that if there are only a few tuples, we will
>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
>> I'd mention it. The smallest allowed value for work_mem is 64kB.
> 
> I think we should change the threshold here to 1/8th.  The worst case
> memory wastage as-is ~32k/5 > 6k.

So you'd lower the threshold to 4kB? That may lower the wastage in the
chunks, but palloc will actually allocate 8kB anyway, wasting up to
additional 4kB. So I don't see how lowering the threshold to 1/8th
improves the situation ...

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra
On 10.9.2014 20:25, Heikki Linnakangas wrote:
> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>> I also did a few 'minor' changes to the dense allocation patch, most
>> notably:
>>
>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>The original naming seemed a bit awkward.
> 
> That's too easy to confuse with regular MemoryContext and AllocChunk
> stuff. I renamed it to HashMemoryChunk.

OK.

> 
>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>threshold for 'oversized' items
>>
>>We need the threshold to be >=8kB, to trigger the special case
>>within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
> 
> Should we care about the fact that if there are only a few tuples, we
> will nevertheless waste 32kB of memory for the chunk? I guess not, but I
> thought I'd mention it. The smallest allowed value for work_mem is 64kB.

I don't think that's a problem.

>> I also considered Heikki's suggestion that while rebatching, we can
>> reuse chunks with a single large tuple, instead of copying it
>> needlessly. My attempt however made the code much uglier (I almost used
>> a GOTO, but my hands rejected to type it ...). I'll look into that.
> 
> I tried constructing a test case where the rehashing time would be 
> significant enough for this to matter, but I couldn't find one. Even
> if the original estimate on the number of batches is way too small,
> we ramp up quickly to a reasonable size and the rehashing time is
> completely insignificant compared to all the other work. So I think
> we can drop that idea.

I don't think that had anything to do with rehashing. What you pointed
out is that when rebatching, we have to keep ~50% of the tuples, which
means 'copy into a new chunk, then throw away the old ones'.

For large tuples (you mentioned 100MB tuples, IIRC), we needlessly
allocate this amount of memory only to copy the single tuple and then
throw away the old tuple. So (a) that's an additional memcpy, and (b) it
allocates additional 100MB of memory for a short period of time.

Now, I'd guess when dealing with tuples this large, there will be more
serious trouble elsewhere, but I don't want to neglect it. Maybe it's
worth optimizing?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Heikki Linnakangas

On 09/10/2014 09:31 PM, Robert Haas wrote:

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
 wrote:

The dense-alloc-v5.patch looks good to me. I have committed that with minor
cleanup (more comments below). I have not looked at the second patch.


Gah.  I was in the middle of doing this.  Sigh.


Sorry.


* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.


Should we care about the fact that if there are only a few tuples, we will
nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
I'd mention it. The smallest allowed value for work_mem is 64kB.


I think we should change the threshold here to 1/8th.  The worst case
memory wastage as-is ~32k/5 > 6k.


Not sure how you arrived at that number. The worst case is that each 32k 
chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each 
chunk. That's 25% wastage. That's not too bad IMHO - the worst case 
waste of AllocSet is 50%.


But if you want to twiddle it, feel free. Note that there's some 
interplay with allocset code, like Tomas mentioned. If the threshold is 
< 8k, palloc() will round up request sizes smaller than 8k anyway. That 
wastes some memory, nothing more serious than that, but it seems like a 
good idea to avoid it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] removing volatile qualifiers from lwlock.c

2014-09-10 Thread Robert Haas
As discussed on the thread "Spinlocks and compiler/memory barriers",
now that we've made the spinlock primitives function as compiler
barriers (we think), it should be possible to remove volatile
qualifiers from many places in the source code.  The attached patch
does this in lwlock.c.  If the changes in commit
0709b7ee72e4bc71ad07b7120acd117265ab51d0 (and follow-on commits) are
correct and complete, applying this shouldn't break anything, while
possibly giving the compiler room to optimize things better than it
does today.

However, demonstrating the necessity of that commit for these changes
seems to be non-trivial.  I tried applying this patch and reverting
commits 5b26278822c69dd76ef89fd50ecc7cdba9c3f035,
b4c28d1b92c81941e4fc124884e51a7c110316bf, and
0709b7ee72e4bc71ad07b7120acd117265ab51d0 on a PPC64 POWER8 box with a
whopping 192 hardware threads (thanks, IBM!).  I then ran the
regression tests repeatedly, and I ran several long pgbench runs with
as many as 350 concurrent clients.  No failures.

So I'm posting this patch in the hope that others can help.  The
relevant tests are:

1. If you apply this patch to master and run tests of whatever kind
strikes your fancy, does anything break under high concurrency?  If it
does, then the above commits weren't enough to make this safe on your
platform.

2. If you apply this patch to master, revert the commits mentioned
above, and again run tests, does anything now break?  If it does (but
the first tests were OK), then that shows that those commits did
something useful on your platform.

The change to make spinlocks act as compiler barriers provides the
foundation for, hopefully, a cleaner code base and easier future work
on scalabilty and performance projects, so help is much appreciated.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 7c96da5..66fb2e4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -112,7 +112,7 @@ static lwlock_stats lwlock_stats_dummy;
 bool		Trace_lwlocks = false;
 
 inline static void
-PRINT_LWDEBUG(const char *where, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, const LWLock *lock)
 {
 	if (Trace_lwlocks)
 		elog(LOG, "%s(%s %d): excl %d shared %d head %p rOK %d",
@@ -406,9 +406,7 @@ LWLock *
 LWLockAssign(void)
 {
 	LWLock	   *result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -429,9 +427,7 @@ int
 LWLockNewTrancheId(void)
 {
 	int			result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -511,9 +507,8 @@ LWLockAcquireWithVar(LWLock *l, uint64 *valptr, uint64 val)
 
 /* internal function to implement LWLockAcquire and LWLockAcquireWithVar */
 static inline bool
-LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
+LWLockAcquireCommon(LWLock *lock, LWLockMode mode, uint64 *valptr, uint64 val)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	bool		result = true;
@@ -525,7 +520,7 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	PRINT_LWDEBUG("LWLockAcquire", lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 
 	/* Count lock acquisition attempts */
 	if (mode == LW_EXCLUSIVE)
@@ -642,13 +637,13 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "waiting");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(lock), T_ID(lock), "waiting");
 
 #ifdef LWLOCK_STATS
 		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock), mode);
 
 		for (;;)
 		{
@@ -659,9 +654,9 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock), mode);
 
-		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "awakened");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(lock), T_ID(lock), "awakened");
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
@@ -675,10 +670,10 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+	TRAC

Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Jan Wieck

On 09/10/2014 11:28 AM, Heikki Linnakangas wrote:

On 09/10/2014 05:57 PM, Fabien COELHO wrote:


Hello Heikki,


I looked closer at the this, and per Jan's comments, realized that we don't
log the lag time in the per-transaction log file. I think that's a serious
omission; when --rate is used, the schedule lag time is important information
to make sense of the result. I think we have to apply the attached patch, and
backpatch to 9.4. (The documentation on the log file format needs updating)


Indeed. I think that people do not like it to change. I remember that I
suggested to change timestamps to ".yy" instead of the unreadable
" yyy", and be told not to, because some people have tool which
process the output so the format MUST NOT CHANGE. So my behavior is not to
avoid touching anything in this area.

I'm fine if you do it, though:-) However I have not time to have a precise
look at your patch to cross-check it before Friday.


This is different from changing "xxx yyy" to "xxx.yyy" in two ways.
First, this adds new information to the log that just isn't there
otherwise, it's not just changing the format for the sake of it. Second,
in this case it's easy to write a parser for the log format so that it
works with the old and new formats. Many such programs would probably
ignore any unexpected extra fields, as a matter of lazy programming,
while changing the separator from space to a dot would surely require
changing every parsing program.

We could leave out the lag fields, though, when --rate is not used.
Without --rate, the lag is always zero anyway. That would keep the file
format unchanged, when you don't use the new --rate feature. I'm not
sure if that would be better or worse for programs that might want to
parse the information.


We could also leave the default output format as is and introduce 
another option with a % style format string.



Jan





Also, this is bizarre:


int64 wait = (int64) (throttle_delay *
   1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));


We're using getrand() to generate a uniformly distributed random value
between 1 and 1, and then convert it to a double between 0.0 and 1.0.


The reason for this conversion is to have randomness but to still avoid
going to extreme multiplier values. The idea is to avoid a very large
multiplier which would generate (even if it is not often) a 0 tps when 100
tps is required. The 1 granularity is basically random but the
multiplier stays bounded (max 9.2, so the minimum possible tps would be
around 11 for a target of 100 tps, bar issues from the database for
processing the transactions).

So although this feature can be discussed and amended, it is fully
voluntary. I think that it make sense so I would prefer to keep it as is.
Maybe the comments could be update to be clearer.


Ok, yeah, the comments indeed didn't mention anything about that. I
don't think such clamping is necessary. With that 9.2x clamp on the
multiplier, the probability that any given transaction hits it is about
1/1. And a delay 9.2 times the average is still quite reasonable,
IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN,
seems to be about 700 x the average, which is still reasonable if you
run a decent number of transactions. And of course, the probability of
hitting such an extreme value is miniscule, so if you're just doing a
few quick test runs with a small total number of transactions, you won't
hit that.

- Heikki




--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
 wrote:
> The dense-alloc-v5.patch looks good to me. I have committed that with minor
> cleanup (more comments below). I have not looked at the second patch.

Gah.  I was in the middle of doing this.  Sigh.

>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>threshold for 'oversized' items
>>
>>We need the threshold to be >=8kB, to trigger the special case
>>within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>
> Should we care about the fact that if there are only a few tuples, we will
> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
> I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th.  The worst case
memory wastage as-is ~32k/5 > 6k.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 1:36 PM, Peter Geoghegan  wrote:
>> In order to know how much we're
>> giving up in that case, we need the exact number I asked you to
>> provide in my previous email: the ratio of the cost of strcoll() to
>> the cost of memcmp().
>>
>> I see that you haven't chosen to provide that information in any of
>> your four responses.
>
> Well, it's kind of difficult to give that number in a vacuum.

No, not really.  All you have to do is right a little test program to
gather the information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Heikki Linnakangas

On 09/10/2014 01:49 AM, Tomas Vondra wrote:

On 9.9.2014 16:09, Robert Haas wrote:

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra  wrote:

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.


Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.


OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch


The dense-alloc-v5.patch looks good to me. I have committed that with 
minor cleanup (more comments below). I have not looked at the second patch.



I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
   The original naming seemed a bit awkward.


That's too easy to confuse with regular MemoryContext and AllocChunk 
stuff. I renamed it to HashMemoryChunk.



* the chunks size is 32kB (instead of 16kB), and we're using 1/4
   threshold for 'oversized' items

   We need the threshold to be >=8kB, to trigger the special case
   within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.


Should we care about the fact that if there are only a few tuples, we 
will nevertheless waste 32kB of memory for the chunk? I guess not, but I 
thought I'd mention it. The smallest allowed value for work_mem is 64kB.



I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.


I tried constructing a test case where the rehashing time would be 
significant enough for this to matter, but I couldn't find one. Even if 
the original estimate on the number of batches is way too small, we ramp 
up quickly to a reasonable size and the rehashing time is completely 
insignificant compared to all the other work. So I think we can drop 
that idea.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-09-10 Thread Pavel Stehule
2014-09-10 0:13 GMT+02:00 Andres Freund :

> On 2014-09-09 22:22:45 +0200, Andres Freund wrote:
> > I plan to push this soon.
>
> Done.
>
> Thanks for the patch!
>

Thank you very much

Pavel


>
> Andres Freund
>
> --
>  Andres Freund http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-10 Thread Peter Geoghegan
On Tue, Sep 9, 2014 at 2:00 PM, Robert Haas  wrote:
> Boiled down, what you're saying is that you might have a set that
> contains lots of duplicates in general, but not very many where the
> abbreviated-keys also match.  Sure, that's true.

Abbreviated keys are not used in the case where we do a (fully)
opportunistic memcmp(), without really having any idea of whether or
not it'll work out. Abbreviated keys aren't really relevant to that
case, except perhaps in that we know we'll have statistics available
for leading attributes, which will make the case less frequent in
practice.

> But you might also
> not have that case, so I don't see where that gets us; the same
> worst-case test case Heikki developed the last time we relitigated
> this point is still relevant here.

Well, I think that there should definitely be a distinction made
between abbreviated and non-abbreviated cases; you could frequently
have almost 100% certainty that each of those optimistic memcmp()s
will work out with abbreviated keys. Low cardinality sets are very
common. I'm not sure what your position on that is. My proposal to
treat both of those cases (abbreviated with a cost model/cardinality
statistics; non-abbreviated without) the same is based on different
arguments for each case.

> In order to know how much we're
> giving up in that case, we need the exact number I asked you to
> provide in my previous email: the ratio of the cost of strcoll() to
> the cost of memcmp().
>
> I see that you haven't chosen to provide that information in any of
> your four responses.

Well, it's kind of difficult to give that number in a vacuum. I showed
a case that had a large majority of opportunistic memcmp()s go to
waste, while a small number were useful, which still put us ahead. I
can look at Heikki's case with this again if you think that'll help.
Heikki said that his case was all about wasted strxfrm()s, which are
surely much more expensive than wasted memcmp()s, particularly when
you consider temporal locality (we needed that memory to be stored in
a cacheline for the immediately subsequent operation anyway, should
the memcmp() thing not work out - the simple ratio that you're
interested in may be elusive).

In case I haven't been clear enough on this point, I re-emphasize that
I do accept that for something like the non-abbreviated case, the
opportunistic memcmp() thing must virtually be free if we're to
proceed, since it is purely opportunistic. If it can be demonstrated
that that isn't the case (and if that cannot be fixed by limiting it
to < CACHE_LINE_SIZE), clearly we should not proceed with
opportunistic (non-abbreviated) memcmp()s. In fact, I think I'm
holding it to a higher standard than you are - I believe that it had
better be virtually free.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-09-10 Thread Andrew Gierth
> "Tomas" == Tomas Vondra  writes:

 Tomas> If we can get rid of the excessive ChainAggregate, that's
 Tomas> certainly enough for now.

I found an algorithm that should provably give the minimal number of sorts
(I was afraid that problem would turn out to be NP-hard, but not so - it's
solvable in P by reducing it to a problem of maximal matching in bipartite
graphs).

Updated patch should be forthcoming in a day or two.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Escaping from blocked send() reprised.

2014-09-10 Thread Heikki Linnakangas

Wrong thread...

On 09/10/2014 03:04 AM, Kyotaro HORIGUCHI wrote:

Hmm. Sorry, I misunderstood the specification.


You approach that coloring tokens seems right, but you have
broken the parse logic by adding your code.

Other than the mistakes others pointed, I found that

- non-SQL-ident like tokens are ignored by their token style,
   quoted or not, so the following line works.

| "local" All aLL trust

I suppose this is not what you intended. This is because you have
igonred the attribute of a token when comparing it as
non-SQL-ident tokens.


- '+' at the head of the sequence '+"' is treated as the first
   character of the *quoted* string. e.g. +"hoge" is tokenized as
   "+hoge":special_quoted.


I found this is what intended. This should be documented as
comments.

|2) users and user-groups only requires special handling and behavior as follows
|Normal user :
|  A. unquoted ( USER ) will be treated as user ( downcase ).
|  B. quoted  ( "USeR" )  will be treated as USeR (case-sensitive).
|  C. quoted ( "+USER" ) will be treated as normal user +USER (i.e. will 
not be considered as user-group) and case-sensitive as string is quoted.

This seems confising with the B below. This seems should be
rearranged.

|   User Group :
|  A. unquoted ( +USERGROUP ) will be treated as +usergruop ( downcase ).
|  B. plus quoted ( +"UserGROUP"  ) will be treated as +UserGROUP 
(case-sensitive).




This is why you simply continued processing for '+"' without
discarding and skipping the '+', and not setting in_quote so the
following parser code works as it is not intended. You should
understand what the original code does and insert or modify
logics not braeking the assumptions.


regards,




--
- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Alignment in Postgres

2014-09-10 Thread Robert Haas
On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva  wrote:
> I'm continuously studying Postgres codebase. Hopefully I'll be able to make
> some contributions in the future.
>
> For now I'm intrigued about the extensive use of memory alignment. I'm sure
> there's some legacy and some architecture that requires it reasoning behind
> it.
>
> That aside, since it wastes space (a lot of space in some cases) there must
> be a tipping point somewhere. I'm sure one can prove aligned access is
> faster in a micro-benchmark but I'm not sure it's the case in a DBMS like
> postgres, specially in the page/rows area.
>
> Just for the sake of comparison Mysql COMPACT storage (default and
> recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed
> 4-byte alignment. Not sure about Oracle and others.
>
> Is it worth the extra space in newer architectures (specially Intel)?
> Do you guys think this is something worth looking at?

Yes.  At least in my opinion, though, it's not a good project for a
beginner.  If you get your changes to take effect, you'll find that a
lot of things will break in places that are not easy to find or fix.
You're getting into really low-level areas of the system that get
touched infrequently and require a lot of expertise in how things work
today to adjust.

The idea I've had before is to try to reduce the widest alignment we
ever require from 8 bytes to 4 bytes.  That is, look for types with
typalign = 'd', and rewrite them to have typalign = 'i' by having them
use two 4-byte loads to load an eight-byte value.  In practice, I
think this would probably save a high percentage of what can be saved,
because 8-byte alignment implies a maximum of 7 bytes of wasted space,
while 4-byte alignment implies a maximum of 3 bytes of wasted space.
And it would probably be pretty cheap, too, because any type with less
than 8 byte alignment wouldn't be affected at all, and even those
types that were affected would only be slightly slowed down by doing
two loads instead of one.  In contrast, getting rid of alignment
requirements completely would save a little more space, but probably
at the cost of a lot more slowdown: any type with alignment
requirements would have to fetch the value byte-by-byte instead of
pulling the whole thing out at once.

But there are a couple of obvious problems with this idea, too, such as:

1. It's really complicated and a ton of work.
2. It would break pg_upgrade pretty darn badly unless we employed some
even-more-complex strategy to mitigate that.
3. The savings might not be enough to justify the effort.

It might be interesting for someone to develop a tool measuring the
number of bytes of alignment padding we lose per tuple or per page and
gather some statistics on it on various databases.  That would give us
some sense as to the possible savings.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Heikki Linnakangas

On 09/10/2014 04:35 PM, Marko Tiikkaja wrote:

Speaking of good looks, should I add it to the next commitfest as a new
patch, or should we try and get someone to review it like this?


Let's handle this in this commitfest.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Heikki Linnakangas

On 09/10/2014 05:57 PM, Fabien COELHO wrote:


Hello Heikki,


I looked closer at the this, and per Jan's comments, realized that we don't
log the lag time in the per-transaction log file. I think that's a serious
omission; when --rate is used, the schedule lag time is important information
to make sense of the result. I think we have to apply the attached patch, and
backpatch to 9.4. (The documentation on the log file format needs updating)


Indeed. I think that people do not like it to change. I remember that I
suggested to change timestamps to ".yy" instead of the unreadable
" yyy", and be told not to, because some people have tool which
process the output so the format MUST NOT CHANGE. So my behavior is not to
avoid touching anything in this area.

I'm fine if you do it, though:-) However I have not time to have a precise
look at your patch to cross-check it before Friday.


This is different from changing "xxx yyy" to "xxx.yyy" in two ways. 
First, this adds new information to the log that just isn't there 
otherwise, it's not just changing the format for the sake of it. Second, 
in this case it's easy to write a parser for the log format so that it 
works with the old and new formats. Many such programs would probably 
ignore any unexpected extra fields, as a matter of lazy programming, 
while changing the separator from space to a dot would surely require 
changing every parsing program.


We could leave out the lag fields, though, when --rate is not used. 
Without --rate, the lag is always zero anyway. That would keep the file 
format unchanged, when you don't use the new --rate feature. I'm not 
sure if that would be better or worse for programs that might want to 
parse the information.



Also, this is bizarre:


int64 wait = (int64) (throttle_delay *
   1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));


We're using getrand() to generate a uniformly distributed random value
between 1 and 1, and then convert it to a double between 0.0 and 1.0.


The reason for this conversion is to have randomness but to still avoid
going to extreme multiplier values. The idea is to avoid a very large
multiplier which would generate (even if it is not often) a 0 tps when 100
tps is required. The 1 granularity is basically random but the
multiplier stays bounded (max 9.2, so the minimum possible tps would be
around 11 for a target of 100 tps, bar issues from the database for
processing the transactions).

So although this feature can be discussed and amended, it is fully
voluntary. I think that it make sense so I would prefer to keep it as is.
Maybe the comments could be update to be clearer.


Ok, yeah, the comments indeed didn't mention anything about that. I 
don't think such clamping is necessary. With that 9.2x clamp on the 
multiplier, the probability that any given transaction hits it is about 
1/1. And a delay 9.2 times the average is still quite reasonable, 
IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN, 
seems to be about 700 x the average, which is still reasonable if you 
run a decent number of transactions. And of course, the probability of 
hitting such an extreme value is miniscule, so if you're just doing a 
few quick test runs with a small total number of transactions, you won't 
hit that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-10 Thread Robert Haas
On Wed, Sep 10, 2014 at 4:48 AM, Fabien COELHO  wrote:
> Note I did not start with the non uniform stuff, but Mitsumasa-san sent a
> gaussian distribution patch and I jumped onto the wagon to complement it
> with an exponential distribution patch. I knew when doing it that is was not
> enough, but as I said "one piece at a time", given the effort required to
> pass simple patch.
>
> What is still needed for the overall purpose is the ability to scatter the
> distribution. This is really:
>
>  (1) a positive remainder modulo, which is the trivial patch under
>  discussion
>
>  (2) a fast but good quality for the purpose hash function
>  also a rather small patch, not submitted yet.
>
>  (3) maybe the '|' operator to do a TP*-like non-uniform load,
>  which is really periodic so I do not like it.
>  a trivial patch, not submitted yet.
>
> If you do not want one of these pieces (1 & 2), basically the interest of
> gaussian/exponential addition is much reduced, and this is just a half baked
> effort aborted because you did not want what was required to make it useful.
> Well, I can only disagree, but you are a committer and I'm not!

I am not objecting to the functionality; I'm objecting to bolting on
ad-hoc operators one at a time.  I think an expression syntax would
let us do this in a much more scalable way.  If I had time, I'd go do
that, but I don't.  We could add abs(x) and hash(x) and it would all
be grand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Aussie timezone database changes incoming

2014-09-10 Thread Tom Lane
In connection with a question asked today on pgsql-general, I had
occasion to go check the release announcements for the IANA timezone
database files, and it turns out that there are some big changes in
2014f:
http://mm.icann.org/pipermail/tz-announce/2014-August/23.html

The Russian changes are perhaps not such a big deal because they've
done that sort of thing before, but this is an earful:

 Australian eastern time zone abbreviations are now AEST/AEDT not
 EST, and similarly for the other Australian zones.  That is, for
 eastern standard and daylight saving time the abbreviations are AEST
 and AEDT instead of the former EST for both; similarly, ACST/ACDT,
 ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST,
 CWST, and WST.  This change does not affect UTC offsets, only time
 zone abbreviations.  (Thanks to Rich Tibbett and many others.)

I'm wondering how many Aussie applications are going to break when
this goes in, and if we could/should do anything about it.  One idea
that comes to mind is to create an "Australia_old" tznames file
containing the current Aussie zone abbreviations, so as to provide
an easy way to maintain backwards compatibility at need (you'd select
that as your timezone_abbreviations GUC setting).

Anyone from down under care to remark about the actual usage of old
and new abbreviations?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-10 Thread Robert Haas
On Tue, Sep 9, 2014 at 6:00 PM, Andres Freund  wrote:
> On 2014-09-09 17:54:03 -0400, Robert Haas wrote:
>> So, that's committed, then.
>
> Yay, finally.
>
>> I think we should pick something that uses
>> spinlocks and is likely to fail spectacularly if we haven't got this
>> totally right yet, and de-volatilize it.  And then watch to see what
>> turns red in the buildfarm and/or which users start screaming.
>
> Good plan.
>
>> I'm inclined to propose lwlock.c as a candidate, since that's very
>> widely used and a place where we know there's significant contention.
>
> I suggest, additionally possibly, GetSnapshotData(). It's surely one of
> the hottest functions in postgres, and I've seen some performance
> increases from de-volatilizing it. IIRC it requires one volatile cast in
> one place to enforce that a variable is accessed only once. Not sure if
> we want to add such volatile casts or use something like linux'
> ACCESS_ONCE macros for some common types. Helps to grep for places
> worthy of inspection.

GetSnapshotData() isn't quite to the point, because it's using a
volatile pointer, but not because of anything about spinlock
manipulation.  That would, perhaps, be a good test for barriers, but
not for this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Fabien COELHO


Hello Heikki,

I looked closer at the this, and per Jan's comments, realized that we don't 
log the lag time in the per-transaction log file. I think that's a serious 
omission; when --rate is used, the schedule lag time is important information 
to make sense of the result. I think we have to apply the attached patch, and 
backpatch to 9.4. (The documentation on the log file format needs updating)


Indeed. I think that people do not like it to change. I remember that I 
suggested to change timestamps to ".yy" instead of the unreadable 
" yyy", and be told not to, because some people have tool which 
process the output so the format MUST NOT CHANGE. So my behavior is not to 
avoid touching anything in this area.


I'm fine if you do it, though:-) However I have not time to have a precise 
look at your patch to cross-check it before Friday.



Also, this is bizarre:


int64 wait = (int64) (throttle_delay *
  1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));


We're using getrand() to generate a uniformly distributed random value 
between 1 and 1, and then convert it to a double between 0.0 and 1.0.


The reason for this conversion is to have randomness but to still avoid 
going to extreme multiplier values. The idea is to avoid a very large 
multiplier which would generate (even if it is not often) a 0 tps when 100 
tps is required. The 1 granularity is basically random but the 
multiplier stays bounded (max 9.2, so the minimum possible tps would be 
around 11 for a target of 100 tps, bar issues from the database for 
processing the transactions).


So although this feature can be discussed and amended, it is fully 
voluntary. I think that it make sense so I would prefer to keep it as is. 
Maybe the comments could be update to be clearer.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Mitsumasa KONDO
Hi,

I find typo in your patch. Please confirm.

@line 239
- agg->sum2_lag = 0;
+  agg->sum_lag = 0;

And back patch is welcome for me.

Best Regards,
--
Mitsumasa KONDO


Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Heikki Linnakangas

On 09/09/2014 03:35 PM, Fabien COELHO wrote:


Hello Heikki,


I think we have to reconsider what we're reporting in 9.4, when --rate
is enabled, even though it's already very late in the release cycle.
It's a bad idea to change the definition of latency between 9.4 and 9.5,
so let's get it right in 9.4.


Indeed.


As per the attached patch. I think we should commit this to 9.4. Any
objections?


Ok for me. Some more propositions about the doc below.


I looked closer at the this, and per Jan's comments, realized that we 
don't log the lag time in the per-transaction log file. I think that's a 
serious omission; when --rate is used, the schedule lag time is 
important information to make sense of the result. I think we have to 
apply the attached patch, and backpatch to 9.4. (The documentation on 
the log file format needs updating)


Also, this is bizarre:


/*
 * Use inverse transform sampling to randomly generate a delay, 
such
 * that the series of delays will approximate a Poisson 
distribution
 * centered on the throttle_delay time.
 *
 * 1 implies a 9.2 (-log(1/1)) to 0.0 (log 1) delay
 * multiplier, and results in a 0.055 % target underestimation 
bias:
 *
 * SELECT 1.0/AVG(-LN(i/1.0)) FROM generate_series(1,1) 
AS i;
 * = 1.00055271703266335474
 *
 * If transactions are too slow or a given wait is shorter than 
a
 * transaction, the next transaction will start right away.
 */
int64   wait = (int64) (throttle_delay *
  1.00055271703 * -log(getrand(thread, 1, 
1) / 1.0));


We're using getrand() to generate a uniformly distributed random value 
between 1 and 1, and then convert it to a double between 0.0 and 
1.0. But getrand() is implemented by taking a double between 0.0 and 1.0 
and converting it an integer, so we're just truncating the original 
floating point number unnecessarily.  I think we should add a new 
function, getPoissonRand(), that uses pg_erand48() directly. We already 
have similiar getGaussianRand() and getExponentialRand() functions. 
Barring objections, I'll prepare another patch to do that, and backpatch 
to 9.4.

- Heikki

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..f7a3a5f 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -210,10 +210,10 @@ typedef struct
  * sent */
 	int			sleeping;		/* 1 indicates that the client is napping */
 	bool		throttling;		/* whether nap is for throttling */
-	int64		until;			/* napping until (usec) */
 	Variable   *variables;		/* array of variable definitions */
 	int			nvariables;
-	instr_time	txn_begin;		/* used for measuring transaction latencies */
+	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
+	instr_time	txn_begin;		/* used for measuring schedule lag times */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
 	int64		txn_latencies;	/* cumulated latencies */
 	int64		txn_sqlats;		/* cumulated square latencies */
@@ -284,12 +284,17 @@ typedef struct
 
 	long		start_time;		/* when does the interval start */
 	int			cnt;			/* number of transactions */
-	double		min_duration;	/* min/max durations */
-	double		max_duration;
-	double		sum;			/* sum(duration), sum(duration^2) - for
+
+	double		min_latency;	/* min/max latencies */
+	double		max_latency;
+	double		sum_latency;	/* sum(latency), sum(latency^2) - for
  * estimates */
-	double		sum2;
+	double		sum2_latency;
 
+	double		min_lag;
+	double		max_lag;
+	double		sum_lag;		/* sum(lag) */
+	double		sum2_lag;		/* sum(lag*lag) */
 } AggVals;
 
 static Command **sql_files[MAX_FILES];	/* SQL script files */
@@ -968,12 +973,18 @@ agg_vals_init(AggVals *aggs, instr_time start)
 {
 	/* basic counters */
 	aggs->cnt = 0;/* number of transactions */
-	aggs->sum = 0;/* SUM(duration) */
-	aggs->sum2 = 0;/* SUM(duration*duration) */
+	aggs->sum_latency = 0;		/* SUM(latency) */
+	aggs->sum2_latency = 0;/* SUM(latency*latency) */
 
 	/* min and max transaction duration */
-	aggs->min_duration = 0;
-	aggs->max_duration = 0;
+	aggs->min_latency = 0;
+	aggs->max_latency = 0;
+
+	/* schedule lag counters */
+	aggs->sum_lag = 0;
+	aggs->sum2_lag = 0;
+	aggs->min_lag = 0;
+	aggs->max_lag = 0;
 
 	/* start of the current interval */
 	aggs->start_time = INSTR_TIME_GET_DOUBLE(start);
@@ -1016,7 +1027,7 @@ top:
 
 		thread->throttle_trigger += wait;
 
-		st->until = thread->throttle_trigger;
+		st->txn_scheduled = thread->throttle_trigger;
 		st->sleeping = 1;
 		st->throttling = true;
 		st->is_throttled = true;
@@ -1032,13 +1043,13 @@ top:
 
 		INSTR_TIME_SET_CURRENT(now);
 		now_us = INSTR_TIME_GET_MICROSEC(now);
-		if (st->until <= now_us)
+		if (st->txn_scheduled <= 

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-09-10 Thread Alvaro Herrera
Thomas Munro wrote:

> @@ -2022,7 +2030,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, 
> int lockmode, bool noWait,
>*/
>   test = heap_lock_tuple(relation, &tuple,
>  
> estate->es_output_cid,
> -lockmode, 
> noWait,
> +lockmode, 
> wait_policy,
>  false, 
> &buffer, &hufd);
>   /* We now have two pins on the buffer, get rid of one */
>   ReleaseBuffer(buffer);

Doesn't this heap_lock_tuple() need to check for a WouldBlock result?
Not sure that this is the same case that you were trying to test in
heap_lock_updated_tuple; I think this requires an update chain (so that
EPQFetch is invoked) and some tuple later in the chain is locked by a
third transaction.

I attach some additional minor suggestions to your patch.  Please feel
free to reword comments differently if you think my wording isn't an
improvements (or I've maked an english mistakes).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-10 Thread Alvaro Herrera
Fujii Masao wrote:
> On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
>  wrote:

> > PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
> > Wouldn't it be easy-to-use to have only one parameter,
> > PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE to
> > work_mem as the default value when running the CREATE INDEX command?
> 
> That's an idea. But there might be some users who want to change
> the cleanup size per session like they can do by setting work_mem,
> and your idea would prevent them from doing that...
> 
> So what about introduing pending_list_cleanup_size also as GUC?
> That is, users can set the threshold by using either the reloption or
> GUC.

Yes, I think having both a GUC and a reloption makes sense -- the GUC
applies to all indexes, and can be tweaked for individual indexes using
the reloption.

I'm not sure about the idea of being able to change it per session,
though.  Do you mean that you would like insert processes use a very
large value so that they can just append new values to the pending list,
and have vacuum use a small value so that it cleans up as soon as it
runs?  Two things: 1. we could have an "autovacuum_" reloption which
only changes what autovacuum does; 2. we could have autovacuum run
index cleanup actions separately from actual vacuuming.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Marko Tiikkaja

On 9/10/14 1:38 PM, Heikki Linnakangas wrote:

On 09/10/2014 02:26 PM, Marko Tiikkaja wrote:

So I wonder if I shouldn't try and instead keep the
code closer to what it is in HEAD right now; I could call
enlargeStringInfo() first, then hand out a pointer to b64_encode (or
b64_decode()) and finally increment StringInfoData.len by how much was
actually written.  That would keep the code changes a lot smaller, too.


Yeah, that sounds reasonable.


OK, I've attemped to do that in the attached.  I'm pretty sure I didn't 
get all of the overflows right, so someone should probably take a really 
good look at it.  (I'm not too confident the original code got them 
right either, but whatever).


Speaking of good looks, should I add it to the next commitfest as a new 
patch, or should we try and get someone to review it like this?



I'm also not sure why we need to keep a copy of the base64
encoding/decoding logic instead of exporting it in utils/adt/encode.c.


Good question...


I've left this question unanswered for now.  We can fix it later, 
independently of this patch.



.marko
*** a/contrib/pgcrypto/pgp-armor.c
--- b/contrib/pgcrypto/pgp-armor.c
***
*** 203,240  crc24(const uint8 *data, unsigned len)
return crc & 0xffL;
  }
  
! int
! pgp_armor_encode(const uint8 *src, unsigned len, uint8 *dst)
  {
!   int n;
!   uint8  *pos = dst;
unsignedcrc = crc24(src, len);
  
!   n = strlen(armor_header);
!   memcpy(pos, armor_header, n);
!   pos += n;
  
!   n = b64_encode(src, len, pos);
!   pos += n;
  
!   if (*(pos - 1) != '\n')
!   *pos++ = '\n';
  
!   *pos++ = '=';
!   pos[3] = _base64[crc & 0x3f];
!   crc >>= 6;
!   pos[2] = _base64[crc & 0x3f];
!   crc >>= 6;
!   pos[1] = _base64[crc & 0x3f];
!   crc >>= 6;
!   pos[0] = _base64[crc & 0x3f];
!   pos += 4;
  
!   n = strlen(armor_footer);
!   memcpy(pos, armor_footer, n);
!   pos += n;
! 
!   return pos - dst;
  }
  
  static const uint8 *
--- 203,239 
return crc & 0xffL;
  }
  
! void
! pgp_armor_encode(const uint8 *src, int len, StringInfo dst)
  {
!   int res;
!   unsignedb64len;
unsignedcrc = crc24(src, len);
  
!   appendStringInfoString(dst, armor_header);
  
!   /* make sure we have enough room to b64_encode() */
!   b64len = b64_enc_len(len);
!   if (b64len > INT_MAX)
!   ereport(ERROR,
!   (errcode(ERRCODE_OUT_OF_MEMORY),
!errmsg("out of memory")));
!   enlargeStringInfo(dst, (int) b64len);
!   res = b64_encode(src, len, (uint8 *) dst->data + dst->len);
!   if (res > b64len)
!   elog(FATAL, "overflow - encode estimate too small");
!   dst->len += res;
  
!   if (*(dst->data + dst->len - 1) != '\n')
!   appendStringInfoChar(dst, '\n');
  
!   appendStringInfoChar(dst, '=');
!   appendStringInfoChar(dst, _base64[(crc >> 18) & 0x3f]);
!   appendStringInfoChar(dst, _base64[(crc >> 12) & 0x3f]);
!   appendStringInfoChar(dst, _base64[(crc >> 6) & 0x3f]);
!   appendStringInfoChar(dst, _base64[crc & 0x3f]);
  
!   appendStringInfoString(dst, armor_footer);
  }
  
  static const uint8 *
***
*** 309,315  find_header(const uint8 *data, const uint8 *datend,
  }
  
  int
! pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst)
  {
const uint8 *p = src;
const uint8 *data_end = src + len;
--- 308,314 
  }
  
  int
! pgp_armor_decode(const uint8 *src, int len, StringInfo dst)
  {
const uint8 *p = src;
const uint8 *data_end = src + len;
***
*** 319,324  pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst)
--- 318,324 
const uint8 *base64_end = NULL;
uint8   buf[4];
int hlen;
+   int blen;
int res = PXE_PGP_CORRUPT_ARMOR;
  
/* armor start */
***
*** 360,382  pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst)
crc = (((long) buf[0]) << 16) + (((long) buf[1]) << 8) + (long) buf[2];
  
/* decode data */
!   res = b64_decode(base64_start, base64_end - base64_start, dst);
! 
!   /* check crc */
!   if (res >= 0 && crc24(dst, res) != crc)
!   res = PXE_PGP_CORRUPT_ARMOR;
  out:
return res;
  }
- 
- unsigned
- pgp_armor_enc_len(unsigned len)
- {
-   return b64_enc_len(len) + strlen(armor_header) + strlen(armor_footer) + 
16;
- }
- 
- unsigned
- pgp_armor_dec_len(unsigned len)
- {
-   return b64_dec_len(len);
- }
--- 360,377 
crc = (((long) buf[0]) << 16) + (((long) buf[1]) << 8) + (long) buf[2];
  
/* decode data */
!   blen = (int) b64_dec_len(len);
!   enlargeStringInfo(dst

Re: [HACKERS] ALTER SYSTEM RESET?

2014-09-10 Thread Fujii Masao
On Wed, Sep 3, 2014 at 12:08 AM, Christoph Berg  wrote:
> Re: Vik Fearing 2014-09-02 <5405d2d9.9050...@dalibo.com>
>> > Uhm, are we agreed on the decision on not to backpatch this?  I would
>> > think this should have been part of the initial ALTER SYSTEM SET patch
>> > and thus should be backpatched to 9.4.
>>
>> I think it belongs in 9.4 as well, especially if we're having another beta.
>
> My original complaint was about 9.4, so I'd like to see it there, yes.

ISTM that the consensus is to do the back-patch to 9.4.
Does anyone object to the back-patch? If not, I will do that.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Heikki Linnakangas

On 09/10/2014 02:26 PM, Marko Tiikkaja wrote:

On 9/9/14 11:01 AM, I wrote:

On 9/9/14 10:54 AM, Heikki Linnakangas wrote:

So I think this (and the corresponding dearmor code too) should be
refactored to use a StringInfo that gets enlarged as needed, instead of
hoping to guess the size correctly beforehand. To ease review, might
make sense to do that as a separate patch over current sources, and the
main patch on top of that.


Yeah, I thought the same thing while writing that code.  Thanks, I'll do
it that way.


Attached is what I have right now.  I started working on the decoding
part, but it has this piece of code:

 /* decode crc */
 if (b64_decode(p + 1, 4, buf) != 3)
 goto out;

which makes the approach a bit uglier.  If I did this the same way, I
would have to create and destroy a StringInfo just for this operation,
which seems ugly.  So I wonder if I shouldn't try and instead keep the
code closer to what it is in HEAD right now; I could call
enlargeStringInfo() first, then hand out a pointer to b64_encode (or
b64_decode()) and finally increment StringInfoData.len by how much was
actually written.  That would keep the code changes a lot smaller, too.


Yeah, that sounds reasonable.


I'm also not sure why we need to keep a copy of the base64
encoding/decoding logic instead of exporting it in utils/adt/encode.c.


Good question...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Marko Tiikkaja

On 9/9/14 11:01 AM, I wrote:

On 9/9/14 10:54 AM, Heikki Linnakangas wrote:

So I think this (and the corresponding dearmor code too) should be
refactored to use a StringInfo that gets enlarged as needed, instead of
hoping to guess the size correctly beforehand. To ease review, might
make sense to do that as a separate patch over current sources, and the
main patch on top of that.


Yeah, I thought the same thing while writing that code.  Thanks, I'll do
it that way.


Attached is what I have right now.  I started working on the decoding 
part, but it has this piece of code:


   /* decode crc */
   if (b64_decode(p + 1, 4, buf) != 3)
   goto out;

which makes the approach a bit uglier.  If I did this the same way, I 
would have to create and destroy a StringInfo just for this operation, 
which seems ugly.  So I wonder if I shouldn't try and instead keep the 
code closer to what it is in HEAD right now; I could call 
enlargeStringInfo() first, then hand out a pointer to b64_encode (or 
b64_decode()) and finally increment StringInfoData.len by how much was 
actually written.  That would keep the code changes a lot smaller, too.


Is either of these approaches anywhere near what you had in mind?

I'm also not sure why we need to keep a copy of the base64 
encoding/decoding logic instead of exporting it in utils/adt/encode.c.



.marko
*** a/contrib/pgcrypto/pgp-armor.c
--- b/contrib/pgcrypto/pgp-armor.c
***
*** 31,36 
--- 31,38 
  
  #include "postgres.h"
  
+ #include "lib/stringinfo.h"
+ 
  #include "px.h"
  #include "pgp.h"
  
***
*** 38,58 
   * BASE64 - duplicated :(
   */
  
! static const unsigned char _base64[] =
  "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
  
! static int
! b64_encode(const uint8 *src, unsigned len, uint8 *dst)
  {
-   uint8  *p,
-  *lend = dst + 76;
const uint8 *s,
   *end = src + len;
int pos = 2;
unsigned long buf = 0;
  
s = src;
-   p = dst;
  
while (s < end)
{
--- 40,58 
   * BASE64 - duplicated :(
   */
  
! static const char _base64[] =
  "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
  
! static void
! b64_encode(const uint8 *src, unsigned len, StringInfo dst)
  {
const uint8 *s,
   *end = src + len;
int pos = 2;
unsigned long buf = 0;
+   int linepos = 0;
  
s = src;
  
while (s < end)
{
***
*** 65,93  b64_encode(const uint8 *src, unsigned len, uint8 *dst)
 */
if (pos < 0)
{
!   *p++ = _base64[(buf >> 18) & 0x3f];
!   *p++ = _base64[(buf >> 12) & 0x3f];
!   *p++ = _base64[(buf >> 6) & 0x3f];
!   *p++ = _base64[buf & 0x3f];
  
pos = 2;
buf = 0;
}
!   if (p >= lend)
{
!   *p++ = '\n';
!   lend = p + 76;
}
}
if (pos != 2)
{
!   *p++ = _base64[(buf >> 18) & 0x3f];
!   *p++ = _base64[(buf >> 12) & 0x3f];
!   *p++ = (pos == 0) ? _base64[(buf >> 6) & 0x3f] : '=';
!   *p++ = '=';
}
- 
-   return p - dst;
  }
  
  /* probably should use lookup table */
--- 65,92 
 */
if (pos < 0)
{
!   appendStringInfoCharMacro(dst, _base64[(buf >> 18) & 
0x3f]);
!   appendStringInfoCharMacro(dst, _base64[(buf >> 12) & 
0x3f]);
!   appendStringInfoCharMacro(dst, _base64[(buf >> 6) & 
0x3f]);
!   appendStringInfoCharMacro(dst, _base64[buf & 0x3f]);
  
+   linepos += 4;
pos = 2;
buf = 0;
}
!   if (linepos >= 76)
{
!   appendStringInfoCharMacro(dst, '\n');
!   linepos = 0;
}
}
if (pos != 2)
{
!   appendStringInfoCharMacro(dst, _base64[(buf >> 18) & 0x3f]);
!   appendStringInfoCharMacro(dst, _base64[(buf >> 12) & 0x3f]);
!   appendStringInfoCharMacro(dst, (pos == 0) ? _base64[(buf >> 6) 
& 0x3f] : '=');
!   appendStringInfoCharMacro(dst, '=');
}
  }
  
  /* probably should use lookup table */
***
*** 160,174  b64_decode(const uint8 *src, unsigned len, uint8 *dst)
  }
  
  static unsigned
- b64_enc_len(unsigned srclen)
- {
-   /*
-* 3 bytes will be converted to 4, linefeed after 76 chars
-*/
-   return (srclen + 2) * 4 / 3 + srclen / (76 * 3 / 4);
- }
- 
- static unsigned
  b64_dec_len(unsigne

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-09-10 Thread Fujii Masao
On Wed, Sep 10, 2014 at 1:45 AM, Petr Jelinek  wrote:
> Hi,
>
> I recently wanted several times to have slave server prepared at certain
> point in time to reduce the time it takes for it to replay remaining WALs
> (say I have pg_basebackup -x on busy db for example).

In your example, you're thinking to perform the recovery after taking
the backup and stop it at the consistent point (i.e., end of backup) by
using the proposed feature? Then you're expecting that the future recovery
will start from that consistent point and which will reduce the recovery time?

This is true if checkpoint is executed at the end of backup. But there might
be no occurrence of checkpoint during backup. In this case, even future
recovery would need to start from very start of backup. That is, we cannot
reduce the recovery time. So, for your purpose, for example, you might also
need to add new option to pg_basebackup so that checkpoint is executed
at the end of backup if the option is set.

> Currently the way to do it is to have pause_at_recovery_target true
> (default) and wait until pg accepts connection and the shut it down. The
> issue is that this is ugly, and also there is a chance that somebody else
> connects and does bad things (tm) before my process does.
>
> So I wrote simple patch that adds option to shut down the cluster once
> recovery_target is reached. The server will still be able to continue WAL
> replay if needed later or can be configured to start standalone.

What about adding something like action_at_recovery_target=pause|shutdown
instead of increasing the number of parameters?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-09-10 Thread Fujii Masao
Thanks for reviewing the patch!

On Wed, Sep 10, 2014 at 4:57 PM, Heikki Linnakangas
 wrote:
> On 08/28/2014 11:38 AM, Fujii Masao wrote:
>>
>> On Thu, Jun 19, 2014 at 5:29 PM, Ian Barwick  wrote:
>>>
>>> - minor rewording for the description, mentioning that statements will
>>>still be logged as DEBUG1 even if parameter set to 'off' (might
>>>prevent reports of the kind "I set it to 'off', why am I still seeing
>>>log entries?").
>>>
>>> 
>>>  Causes each replication command to be logged in the server log.
>>>  See  for more information
>>> about
>>>  replication commands. The default value is off. When
>>> set
>>> to
>>>  off, commands will be logged at log level
>>> DEBUG1.
>>>  Only superusers can change this setting.
>>> 
>>
>>
>> Yep, fixed. Attached is the updated version of the patch.
>
>
> I don't think it's necessary to mention that the commands will still be
> logged at DEBUG1 level. We log all kinds of crap at the various DEBUG
> levels, and they're not supposed to be used in normal operation.

Agreed. I removed that mention from the document.

>
>>> - I feel it would be more consistent to use the plural form
>>>for this parameter, i.e. "log_replication_commands", in line with
>>>"log_lock_waits", "log_temp_files", "log_disconnections" etc.
>>
>>
>> But log_statement is in the singular form. So I just used
>> log_replication_command. For the consistency, maybe we need to
>> change both parameters in the plural form? I don't have strong
>> opinion about this.
>
>
> Yeah, we seem to be inconsistent. log_replication_commands would sound
> better to me in isolation, but then there is log_statement..

Agreed. I changed the GUC name to log_replication_commands.

>
> I'll mark this as Ready for Committer in the commitfest app (I assume you'll
> take care of committing this yourself when ready).

Attached is the updated version of the patch. After at least one day
I will commit the patch.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 4558,4563  FROM pg_stat_activity;
--- 4558,4579 

   
  
+  
+   log_replication_commands (boolean)
+   
+log_replication_commands configuration parameter
+   
+   
+   
+
+ Causes each replication command to be logged in the server log.
+ See  for more information about
+ replication command. The default value is off.
+ Only superusers can change this setting.
+
+   
+  
+ 
   
log_temp_files (integer)

*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 1306,1311  To initiate streaming replication, the frontend sends the
--- 1306,1313 
  of true tells the backend to go into walsender mode, wherein a
  small set of replication commands can be issued instead of SQL statements. Only
  the simple query protocol can be used in walsender mode.
+ Replication commands are logged in the server log when
+  is enabled.
  Passing database as the value instructs walsender to connect to
  the database specified in the dbname parameter, which will allow
  the connection to be used for logical replication from that database.
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***
*** 108,113  bool		am_db_walsender = false;	/* Connected to a database? */
--- 108,114 
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
  int			wal_sender_timeout = 60 * 1000;		/* maximum time to send one
   * WAL data message */
+ bool		log_replication_commands = false;
  
  /*
   * State for WalSndWakeupRequest
***
*** 1268,1280  exec_replication_command(const char *cmd_string)
  	MemoryContext old_context;
  
  	/*
  	 * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next
  	 * command arrives. Clean up the old stuff if there's anything.
  	 */
  	SnapBuildClearExportedSnapshot();
  
- 	elog(DEBUG1, "received replication command: %s", cmd_string);
- 
  	CHECK_FOR_INTERRUPTS();
  
  	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
--- 1269,1287 
  	MemoryContext old_context;
  
  	/*
+ 	 * Log replication command if log_replication_commands is enabled.
+ 	 * Even when it's disabled, log the command with DEBUG1 level for
+ 	 * backward compatibility.
+ 	 */
+ 	ereport(log_replication_commands ? LOG : DEBUG1,
+ 			(errmsg("received replication command: %s", cmd_string)));
+ 
+ 	/*
  	 * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next
  	 * command arrives. Clean up the old stuff if there's anything.
  	 */
  	SnapBuildClearExportedSnapshot();
  
  	CHECK_FOR_INTERRUPTS();
  
  	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 925,930  static str

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Marko Tiikkaja

On 9/10/14 12:05 PM, Etsuro Fujita wrote:

(2014/09/10 18:33), Marko Tiikkaja wrote:

You can already change *this patch* to do ModifyTable <- Limit <-
LockRows.  If we think that's what we want, we should rewrite this patch
right now.


I think it might be relatively easy to do that for single-table cases,
but for inheritance cases, inheritance_planner needs work and I'm not
sure how much work it would take ...


Uh.  Yeah, I think I'm an idiot and you're right.

I'll try and get some benchmarking done and see what comes out.


.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Etsuro Fujita

(2014/09/10 18:33), Marko Tiikkaja wrote:

On 9/10/14 11:25 AM, Etsuro Fujita wrote:

The reason is because I think that after implementing #2, we should
re-implement this feature by extending the planner to produce a plan
tree such as ModifyTable+Limit+Append, maybe with LockRows below the
Limit node.  Such an approach would prevent duplication of the LIMIT
code in ModifyTable, making the ModifyTable code more simple, I think.



You can already change *this patch* to do ModifyTable <- Limit <-
LockRows.  If we think that's what we want, we should rewrite this patch
right now.


I think it might be relatively easy to do that for single-table cases, 
but for inheritance cases, inheritance_planner needs work and I'm not 
sure how much work it would take ...



Like I said upthread, I think LockRows is a bad idea, but I'll need to
run some performance tests first.  But whichever method we decide to
implement for this patch shouldn't need to be touched when the changes
to UPDATE land, so your reasoning is incorrect.


Yeah, as you say, we need the performance tests, and I think it would 
depend on those results whether LockRows is a bad idea or not.


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-10 Thread Fujii Masao
On Wed, Sep 10, 2014 at 5:35 PM, Etsuro Fujita
 wrote:
> (2014/09/10 12:31), Fujii Masao wrote:
>>
>> On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
>>  wrote:
>>>
>>> (2014/09/09 22:17), Fujii Masao wrote:

 Attached is the updated version of the patch.
>
>
>>> I took a quick review on the patch.  It looks good to me,
>
>
>>> but one thing I'm
>>> concerned about is
>>>
>>> You wrote:
>>>
>>> The attached patch introduces the GIN index storage parameter
>>> "PENDING_LIST_CLEANUP_SIZE" which specifies the maximum size of
>>> GIN pending list. If it's not set, work_mem is used as that maximum
>>> size,
>>> instead. So this patch doesn't break the existing application which
>>> currently uses work_mem as the threshold of cleanup operation of
>>> the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.
>>>
>>>
>>> As you mentioned, I think it's important to consider for the existing
>>> applications, but I'm wondering if it would be a bit confusing users to
>>> have
>>> two parameters,
>>
>>
>> Yep.
>>
>>> PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
>>> Wouldn't it be easy-to-use to have only one parameter,
>>> PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE
>>> to
>>> work_mem as the default value when running the CREATE INDEX command?
>>
>>
>> That's an idea. But there might be some users who want to change
>> the cleanup size per session like they can do by setting work_mem,
>> and your idea would prevent them from doing that...
>
>
> Why not use ALTER INDEX ... SET (PENDING_LIST_CLEANUP_SIZE= ...)?  Maybe I'm
> missing something, though.

It takes AccessExclusive lock and has an effect on every sessions
(not only specified session).

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Marko Tiikkaja

On 9/10/14 11:25 AM, Etsuro Fujita wrote:

The reason is because I think that after implementing #2, we should
re-implement this feature by extending the planner to produce a plan
tree such as ModifyTable+Limit+Append, maybe with LockRows below the
Limit node.  Such an approach would prevent duplication of the LIMIT
code in ModifyTable, making the ModifyTable code more simple, I think.


You can already change *this patch* to do ModifyTable <- Limit <- 
LockRows.  If we think that's what we want, we should rewrite this patch 
right now.  This isn't a reason not to implement LIMIT without ORDER BY.


Like I said upthread, I think LockRows is a bad idea, but I'll need to 
run some performance tests first.  But whichever method we decide to 
implement for this patch shouldn't need to be touched when the changes 
to UPDATE land, so your reasoning is incorrect.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-10 Thread Florian Pflug
On Sep10, 2014, at 10:54 , Kyotaro HORIGUCHI  
wrote:
> Under the new specifications, next_token will work as following,
> 
>  - USER  : token: USER  , case-insensitive
>  - "USeR": token: USeR  , case-SENSITIVE
>  - "+uSeR"   : token: +uSeR , case-SENSITIVE
>  - "+UsE"R   : token: +UsEr , case-insensitive
>  - U"S""e"R  : token: US"eR , case-insensitive
> 
>  - +USER : token: USER  , case-insensitive, group_name
>  - +"uSeR"   : token: uSeR  , case_SENSITIVE, group_name
>  - +U"sE""r" : token: UsE"r , case-insensitive, group_name
> 
>  - + : token: + , (useless?)
>  - @ : token: @ , (useless?)
>  - @ho"ge: token: ho"ge, file_inclusion (not confirmed)
> 
> 
> There's a concern that Case-insensitive matching is accomplished
> by full-scan on pg_database or pg_authid so it would be rather
> slow than case-sensitive matching. This might not be acceptable
> by the community.

That does indeed sound bad. Couldn't we handle this the same
way we handle SQL identifiers, i.e. simply downcase unquoted
identifiers, and then compare case-sensitively?

So foo, Foo and FOO would all match the user called ,
but "Foo" would match the user called , and "FOO" the
user called .

An unquoted "+" would cause whatever follows it to be interpreted
as a group name, whereas a quoted "+" would simply become part of
the user name (or group name, if there's an additional unquoted
"+" before it).

So +foo would refer to the group , +"FOO" to the group ,
and +"+A" to the group <+A>.

I haven't checked if such an approach would be sufficiently
backwards-compatible, though.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-10 Thread Kyotaro HORIGUCHI
Hmm...

case-insensitive mathing could get multiple matches, which should be an
error but I've forgot to do so.

regards,

2014/09/10 17:54 "Kyotaro HORIGUCHI" :

> And one known defect is that you will get a bit odd message if
> you put an hba line having keywords quoted or prefixed with '+',
> for example
>
> +locAl   "postgres"   +sUstRust
>
> The server complains for the line above that
>
> *| LOG:  invalid connection type "locAl"
>  | CONTEXT:  line 84 of configuration file
"/home/horiguti/data/data_work/pg_hba.conf"
>
> The prefixing '+' is omitted. To correct this, either deparsing
> token into original string or storing original string into tokens
> is needed, I think.
>
> What do you think about the changes, Viswanatham or all ?

--
Kyotaro Horiguchi
NTT Open Source Software Center


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Etsuro Fujita

(2014/09/10 16:57), Marko Tiikkaja wrote:

On 2014-09-10 04:25, Etsuro Fujita wrote:

(2014/09/09 18:57), Heikki Linnakangas wrote:

What's not clear to me is whether it make sense to do 1) without 2) ? Is
UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we
apply this patch now, how much of it needs to be rewritten after 2) ? If
the answers are "yes" and "not much", then we should review this patch
now, and put 2) on the TODO list. Otherwise 2) should do done first.


My answers are "yes" but "completely rewritten".


Any particular reason for you to say that?  Because an UPDATE might have
a RETURNING clause, all the updated tuples have to go through the
ModifyTable node one at a time.  I don't see why we couldn't LIMIT there
after implementing #2.


The reason is because I think that after implementing #2, we should 
re-implement this feature by extending the planner to produce a plan 
tree such as ModifyTable+Limit+Append, maybe with LockRows below the 
Limit node.  Such an approach would prevent duplication of the LIMIT 
code in ModifyTable, making the ModifyTable code more simple, I think.


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-10 Thread Kyotaro HORIGUCHI
Hello, I had a closer look on this patch.

Finally I think that we need case-insensitive version of
get_role_id and() get_database_id() to acoomplish this patch'es
objective. (This runs full-scans on pg_database or pg_authid X()

And I'd like to propose to change token categorization from
notation-base to how-to-treat base. Concretely this patch
categorizes tokens using 'special quote is used' and 'quote from
the first' but it seems making logics clearer to categorize them
using 'case sensive or not' and 'it represents group name'.

The attached patch is a revised version of your original patch
regarding to the above point. (Sorry in advance that this is a
quick hack, especially the code related to file-inclusion is not
tested at all)

I have tested this only superficial level but it seems works as
expected.

Under the new specifications, next_token will work as following,

  - USER  : token: USER  , case-insensitive
  - "USeR": token: USeR  , case-SENSITIVE
  - "+uSeR"   : token: +uSeR , case-SENSITIVE
  - "+UsE"R   : token: +UsEr , case-insensitive
  - U"S""e"R  : token: US"eR , case-insensitive

  - +USER : token: USER  , case-insensitive, group_name
  - +"uSeR"   : token: uSeR  , case_SENSITIVE, group_name
  - +U"sE""r" : token: UsE"r , case-insensitive, group_name

  - + : token: + , (useless?)
  - @ : token: @ , (useless?)
  - @ho"ge: token: ho"ge, file_inclusion (not confirmed)


There's a concern that Case-insensitive matching is accomplished
by full-scan on pg_database or pg_authid so it would be rather
slow than case-sensitive matching. This might not be acceptable
by the community.

And one known defect is that you will get a bit odd message if
you put an hba line having keywords quoted or prefixed with '+',
for example

+locAl   "postgres"   +sUstRust

The server complains for the line above that 

*| LOG:  invalid connection type "locAl"
 | CONTEXT:  line 84 of configuration file 
"/home/horiguti/data/data_work/pg_hba.conf"

The prefixing '+' is omitted. To correct this, either deparsing
token into original string or storing original string into tokens
is needed, I think.

What do you think about the changes, Viswanatham or all ?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f480be8..db73dd9 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1991,6 +1991,50 @@ get_database_oid(const char *dbname, bool missing_ok)
 	return oid;
 }
 
+/*
+ * get_database_oid - given a database name, look up the OID in
+ * case-insensitive manner.
+ *
+ * If missing_ok is false, throw an error if database name not found.  If
+ * true, just return InvalidOid.
+ */
+Oid
+get_database_oid_case_insensitive(const char *dbname, bool missing_ok)
+{
+	Relation	relation;
+	SysScanDesc scandesc;
+	HeapTuple	tuple;
+	Oid oid = InvalidOid;
+
+	/*
+	 * SysCache has no abirility to case insensitive match, so we have no
+	 * means except scanning whole the systable.
+	 */
+	relation = heap_open(DatabaseRelationId, AccessShareLock);
+
+	scandesc = systable_beginscan(relation, InvalidOid, false,
+  NULL, 0, NULL);
+	while (HeapTupleIsValid(tuple = systable_getnext(scandesc)))
+	{
+		Form_pg_database dbForm = (Form_pg_database) GETSTRUCT(tuple);
+
+		if (pg_strcasecmp(dbname, dbForm->datname.data) == 0)
+		{
+			oid = HeapTupleGetOid(tuple);
+			break;
+		}
+	}
+	systable_endscan(scandesc);
+	heap_close(relation, AccessShareLock);
+
+	if (!OidIsValid(oid) && !missing_ok)
+ 		ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_DATABASE),
+ errmsg("database \"%s\" does not exist",
+		dbname)));
+
+	return oid;
+}
 
 /*
  * get_database_name - given a database OID, look up the name
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 84da823..2d3a059 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -60,9 +60,20 @@ typedef struct check_network_data
 	bool		result;			/* set to true if match */
 } check_network_data;
 
-
-#define token_is_keyword(t, k)	(!t->quoted && strcmp(t->string, k) == 0)
-#define token_matches(t, k)  (strcmp(t->string, k) == 0)
+typedef enum TokenType
+{
+	NORMAL,
+	GROUP_NAME,			/* this token had leading '+' */
+	FILE_INCLUSION,		/* this token had leading '@' */
+} TokenType;
+
+#define token_is_keyword(tk, kw)	\
+	((tk)->type != NORMAL || (tk)->case_sensitive ? false : \
+	 (pg_strcasecmp((tk)->string, (kw)) == 0))
+#define token_matches(t, k)		   \
+	((t)->type != NORMAL ? false :\
+	 ((t)->case_sensitive ? (strcmp((t)->string, (k)) == 0):	\
+	  (pg_strcasecmp((t)->string, (k)) == 0)))
 
 /*
  * A single string token lexed from the HBA config file, together with whether
@@ -71,7 +82,8 @@ typedef struct check_network_data
 typedef struct HbaToken
 {
 	char	   *string;
-	bool		quoted;
+	TokenType	type;
+	bool		case_sensitive;
 } HbaToken;
 
 

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-10 Thread Fabien COELHO


Hello Robert,


Sure, and I would have looked at that patch and complained that you
were implementing a modulo operator with different semantics than C.
And then we'd be right back where we are now.
[...]


Probably.

To be clear about my intent, which is a summary of what you already know: 
I would like to be able to generate a reasonable non uniform throttled 
workload with pgbench.


I do think such a feature is worth having for anybody who would like to do 
something realistic with pgbench, and that it is in the "general 
interest" of postgresql to have such features.


In particular, given the current state of abysmal performance under some 
trivial load with pg, I really think that it is really worth having a 
better tool, and I think that my effort with rate and progress helped to 
put these hidden problems into the light, and I do hope that they are 
going to be solved quickly.


Now if I submitted a big patch with all the necessary features in it, 
someone would ask to break it into pieces. So they are submitted one by 
one (progress, throttling, limit, modulo, ...).


Note I did not start with the non uniform stuff, but Mitsumasa-san sent a 
gaussian distribution patch and I jumped onto the wagon to complement it 
with an exponential distribution patch. I knew when doing it that is was 
not enough, but as I said "one piece at a time", given the effort required 
to pass simple patch.


What is still needed for the overall purpose is the ability to scatter the 
distribution. This is really:


 (1) a positive remainder modulo, which is the trivial patch under
 discussion

 (2) a fast but good quality for the purpose hash function
 also a rather small patch, not submitted yet.

 (3) maybe the '|' operator to do a TP*-like non-uniform load,
 which is really periodic so I do not like it.
 a trivial patch, not submitted yet.

If you do not want one of these pieces (1 & 2), basically the interest of 
gaussian/exponential addition is much reduced, and this is just a half 
baked effort aborted because you did not want what was required to make it 
useful. Well, I can only disagree, but you are a committer and I'm not!


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-10 Thread Etsuro Fujita

(2014/09/10 12:31), Fujii Masao wrote:

On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
 wrote:

(2014/09/09 22:17), Fujii Masao wrote:

Attached is the updated version of the patch.



I took a quick review on the patch.  It looks good to me,



but one thing I'm
concerned about is

You wrote:

The attached patch introduces the GIN index storage parameter
"PENDING_LIST_CLEANUP_SIZE" which specifies the maximum size of
GIN pending list. If it's not set, work_mem is used as that maximum
size,
instead. So this patch doesn't break the existing application which
currently uses work_mem as the threshold of cleanup operation of
the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.


As you mentioned, I think it's important to consider for the existing
applications, but I'm wondering if it would be a bit confusing users to have
two parameters,


Yep.


PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
Wouldn't it be easy-to-use to have only one parameter,
PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE to
work_mem as the default value when running the CREATE INDEX command?


That's an idea. But there might be some users who want to change
the cleanup size per session like they can do by setting work_mem,
and your idea would prevent them from doing that...


Why not use ALTER INDEX ... SET (PENDING_LIST_CLEANUP_SIZE= ...)?  Maybe 
I'm missing something, though.



So what about introduing pending_list_cleanup_size also as GUC?
That is, users can set the threshold by using either the reloption or
GUC.


Yeah, that's an idea.  So, I'd like to hear the opinions of others.

Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Mark Kirkwood

On 10/09/14 18:54, Amit Kapila wrote:

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood
mailto:mark.kirkw...@catalyst.net.nz>>
wrote:
 >
 > In terms of the effect of the patch - looks pretty similar to the
scale 2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even
better perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.




No worries, this is looking like a patch we're going to apply to 9.4 to 
make the 60 core beast scale a bit better, so thanks very much for your 
work in this area.


If you would like more tests run at higher scales let me know (we have 
two of these machines at pre-production state currently so I can run 
benchmarks as reqd)!


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-09-10 Thread Heikki Linnakangas

On 08/28/2014 11:38 AM, Fujii Masao wrote:

On Thu, Jun 19, 2014 at 5:29 PM, Ian Barwick  wrote:

- minor rewording for the description, mentioning that statements will
   still be logged as DEBUG1 even if parameter set to 'off' (might
   prevent reports of the kind "I set it to 'off', why am I still seeing
   log entries?").


 Causes each replication command to be logged in the server log.
 See  for more information about
 replication commands. The default value is off. When set
to
 off, commands will be logged at log level
DEBUG1.
 Only superusers can change this setting.



Yep, fixed. Attached is the updated version of the patch.


I don't think it's necessary to mention that the commands will still be 
logged at DEBUG1 level. We log all kinds of crap at the various DEBUG 
levels, and they're not supposed to be used in normal operation.



- I feel it would be more consistent to use the plural form
   for this parameter, i.e. "log_replication_commands", in line with
   "log_lock_waits", "log_temp_files", "log_disconnections" etc.


But log_statement is in the singular form. So I just used
log_replication_command. For the consistency, maybe we need to
change both parameters in the plural form? I don't have strong
opinion about this.


Yeah, we seem to be inconsistent. log_replication_commands would sound 
better to me in isolation, but then there is log_statement..


I'll mark this as Ready for Committer in the commitfest app (I assume 
you'll take care of committing this yourself when ready).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Marko Tiikkaja

On 2014-09-10 04:25, Etsuro Fujita wrote:

(2014/09/09 18:57), Heikki Linnakangas wrote:

What's not clear to me is whether it make sense to do 1) without 2) ? Is
UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we
apply this patch now, how much of it needs to be rewritten after 2) ? If
the answers are "yes" and "not much", then we should review this patch
now, and put 2) on the TODO list. Otherwise 2) should do done first.


My answers are "yes" but "completely rewritten".


Any particular reason for you to say that?  Because an UPDATE might have 
a RETURNING clause, all the updated tuples have to go through the 
ModifyTable node one at a time.  I don't see why we couldn't LIMIT there 
after implementing #2.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FD_SETSIZE on Linux?

2014-09-10 Thread Thom Brown
On 10 September 2014 00:21, Tom Lane  wrote:

> Thom Brown  writes:
> > I find this in pgbench.c:
>
> > #ifdef FD_SETSIZE
> > #define MAXCLIENTS  (FD_SETSIZE - 10)
> > #else
> > #define MAXCLIENTS  1024
> > #endif
>
> FD_SETSIZE is supposed to be defined, according to the POSIX spec:
>
> The  header shall define the following symbolic constant,
> which shall have a value suitable for use in #if preprocessing
> directives:
>
> FD_SETSIZE
> Maximum number of file descriptors in an fd_set structure.
>
> It looks like Linux sets it to 1024.  On RHEL6, at least, I find this:
>
> $ grep -r FD_SETSIZE /usr/include
> /usr/include/linux/posix_types.h:#undef __FD_SETSIZE
> /usr/include/linux/posix_types.h:#define __FD_SETSIZE   1024
> ...
> /usr/include/sys/select.h:#define   FD_SETSIZE
> __FD_SETSIZE
> ...
>

Ah yes, I have the same on Debian:

/usr/include/linux/posix_types.h:#undef __FD_SETSIZE
/usr/include/linux/posix_types.h:#define __FD_SETSIZE 1024
...
usr/include/x86_64-linux-gnu/sys/select.h:#define FD_SETSIZE __FD_SETSIZE
/usr/include/x86_64-linux-gnu/bits/typesizes.h:#define __FD_SETSIZE 1024
...

I didn't think to look beyond Postgres' code.


> > #ifdef WIN32
> > #define FD_SETSIZE 1024 /* set before winsock2.h is
> included */
> > #endif   /* ! WIN32 */
>
> Windows probably hasn't got sys/select.h at all, so it may not provide
> this symbol.
>
> Interestingly, it looks like POSIX also requires  to define
> FD_SETSIZE.  I wonder whether Windows has that header?  It'd definitely
> be better to get this symbol from the system than assume 1024 will work.
>

Okay, this now makes sense.  It just takes the system value and reduces it
by 10 to get the MAXCLIENTS value.

Thanks for the explanation.

Thom