Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements

2013-07-14 Thread Martijn van Oosterhout
On Sun, Jul 14, 2013 at 03:36:09AM -0300, Fabrízio de Royes Mello wrote:
  Next, changes in src/backend, starting with parser changes: the patch
  adds IF_P NOT EXISTS variants for various productions. For example:

snip

  I think opt_if_not_exists should be used for the others as well.
 
 
 I could not use the opt_if_not_exists because bison emits an error:
 
 /usr/bin/bison -d -o gram.c gram.y
 gram.y: conflicts: 10 shift/reduce
 gram.y: expected 0 shift/reduce conflicts
 make[3]: *** [gram.c] Error 1
 
 I really don't know how to solve this problem. I'm just do ajustments like
 that:

This probably isn't solvable, which is why the coding is double in many
existing places. The issue is that by using opt_if_not_exists you make
that bison has to decide much earlier which rule it is parsing. Bison
only has one token lookahead and if that's not enough you get errors.

BTW, bison dumps a large file describing all its states that you should
be able to work out from that where the exact problem lies.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Removing Inner Joins

2013-07-14 Thread Atri Sharma


Sent from my iPad

On 10-Jul-2013, at 13:11, Hannu Krosing ha...@2ndquadrant.com wrote:

 On 07/10/2013 09:18 AM, Atri Sharma wrote:
 Can you please post an example of such a join removal? I mean a query before
 and after the removal. Thanks,
 Courtesy Robert Haas:
 
 SELECT foo.x, foo.y, foo.z FROM foo WHERE foo.x = bar.x
 
 Conditions:
 
 1) foo.x is not null.
 I guess that this is also not needed. you can just remove rows where
 
 foo.x is null
 
 That is, replace the join with foo.x is not null
 
 2) foo (x) is a foreign key referencing bar (x).
 
 We can ignore bar completely in this case i.e. avoid scanning bar.
 
 Regards,
 
 Atri
 
 
 --
 Regards,
 
 Atri
 l'apprenant
 
 
 
 
 

I discussed with RhodiumToad and was exploring potential design methods with 
which we can solve the above problem. My focus is to add support for foreign 
key detection in planner and allow planner to make decisions based on it.

It wouldn't be too much of a cost to maintain the foreign key column and the 
referenced table. The main issue, as pointed out by RHodiumToad is primarily 
that, between the generation of the plan, which is made with accordance to the 
foreign key presence, and the execution of the plan, we may get into an 
inconsistent state when the corresponding row is deleted or constraints are 
changed and fk trigger has not yet run and detected those changes.

I was thinking of row level locks,which are done by the fk trigger.Is there any 
way we can modify the fk trigger to forcibly run? Or add an 'looked at' bit, 
which is reset when a table/row is changed, and set by the fk trigger when it 
runs on that table?

I am just thinking wild here, please let me know your thoughts, feedback and 
ideas.

Regards,

Atri

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing Inner Joins

2013-07-14 Thread Hannu Krosing
On 07/14/2013 06:10 PM, Atri Sharma wrote:

 Sent from my iPad

 On 10-Jul-2013, at 13:11, Hannu Krosing ha...@2ndquadrant.com wrote:

 On 07/10/2013 09:18 AM, Atri Sharma wrote:
 Can you please post an example of such a join removal? I mean a query 
 before
 and after the removal. Thanks,
 Courtesy Robert Haas:

 SELECT foo.x, foo.y, foo.z FROM foo WHERE foo.x = bar.x

 Conditions:

 1) foo.x is not null.
 I guess that this is also not needed. you can just remove rows where

 foo.x is null

 That is, replace the join with foo.x is not null
 2) foo (x) is a foreign key referencing bar (x).

 We can ignore bar completely in this case i.e. avoid scanning bar.

 Regards,

 Atri


 --
 Regards,

 Atri
 l'apprenant




 I discussed with RhodiumToad and was exploring potential design methods with 
 which we can solve the above problem. My focus is to add support for foreign 
 key detection in planner and allow planner to make decisions based on it.

 It wouldn't be too much of a cost to maintain the foreign key column and the 
 referenced table. The main issue, as pointed out by RHodiumToad is primarily 
 that, between the generation of the plan, which is made with accordance to 
 the foreign key presence, and the execution of the plan, we may get into an 
 inconsistent state when the corresponding row is deleted or constraints are 
 changed and fk trigger has not yet run and detected those changes.
Is this not all transactional and taken care of by MVCC ?

That is, the problem can only happen for prepared plans, which need
to have invalidation in case of underlaying DDL / schema changes anyway ?

Or are you worried about the case where the FK constraint is delayed and
thus the plan can be invalid between the change and running of FK trigger
in the same transaction ?

 I was thinking of row level locks,which are done by the fk trigger.Is there 
 any way we can modify the fk trigger to forcibly run? Or add an 'looked at' 
 bit, which is reset when a table/row is changed, and set by the fk trigger 
 when it runs on that table?

 I am just thinking wild here, please let me know your thoughts, feedback and 
 ideas.

 Regards,

 Atri


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing Inner Joins

2013-07-14 Thread Atri Sharma


Sent from my iPad

On 14-Jul-2013, at 22:12, Hannu Krosing ha...@2ndquadrant.com wrote:

 On 07/14/2013 06:10 PM, Atri Sharma wrote:
 
 Sent from my iPad
 
 On 10-Jul-2013, at 13:11, Hannu Krosing ha...@2ndquadrant.com wrote:
 
 On 07/10/2013 09:18 AM, Atri Sharma wrote:
 Can you please post an example of such a join removal? I mean a query 
 before
 and after the removal. Thanks,
 Courtesy Robert Haas:
 
 SELECT foo.x, foo.y, foo.z FROM foo WHERE foo.x = bar.x
 
 Conditions:
 
 1) foo.x is not null.
 I guess that this is also not needed. you can just remove rows where
 
 foo.x is null
 
 That is, replace the join with foo.x is not null
 2) foo (x) is a foreign key referencing bar (x).
 
 We can ignore bar completely in this case i.e. avoid scanning bar.
 
 Regards,
 
 Atri
 
 
 --
 Regards,
 
 Atri
 l'apprenant
 I discussed with RhodiumToad and was exploring potential design methods with 
 which we can solve the above problem. My focus is to add support for foreign 
 key detection in planner and allow planner to make decisions based on it.
 
 It wouldn't be too much of a cost to maintain the foreign key column and the 
 referenced table. The main issue, as pointed out by RHodiumToad is primarily 
 that, between the generation of the plan, which is made with accordance to 
 the foreign key presence, and the execution of the plan, we may get into an 
 inconsistent state when the corresponding row is deleted or constraints are 
 changed and fk trigger has not yet run and detected those changes.
 Is this not all transactional and taken care of by MVCC ?
 
 That is, the problem can only happen for prepared plans, which need
 to have invalidation in case of underlaying DDL / schema changes anyway ?
 
 Or are you worried about the case where the FK constraint is delayed and
 thus the plan can be invalid between the change and running of FK trigger
 in the same transaction ?

Yes, that is precisely what I am concerned about.Thanks for wording it so 
nicely!

Regards,

Atri
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql: Optimize pglz compressor for small inputs.

2013-07-14 Thread Stephen Frost
Heikki,

* Heikki Linnakangas (heikki.linnakan...@iki.fi) wrote:
 This patch alleviates that in two ways. First, instead of storing pointers
 in the hash table, store 16-bit indexes into the hist_entries array. That
 slashes the size of the hash table to 1/2 or 1/4 of the original, depending
 on the pointer width. Secondly, adjust the size of the hash table based on
 input size. For very small inputs, you don't need a large hash table to
 avoid collisions.

  The coverity scanner has a bit of an issue with this patch which, at
  least on first blush, looks like a valid concern.
  
  While the change in pg_lzcompress.c:pglz_find_match() to loop on:
  
  while (hent != INVALID_ENTRY_PTR)
  {
  const char *ip = input;
  const char *hp = hent-pos;

  looks good, and INVALID_ENTRY_PTR is the address of the first entry in
  the array (and can't be NULL), towards the end of the loop we do:

  hent = hent-next;
  if (hent)
  ...

  Should we really be checking for 'hent != INVALID_ENTRY_PTR' here?  If
  not, and hent really can end up as NULL, then we're going to segfault
  on the next loop due to the unchecked 'hent-pos' early in the loop.
  If hent can never be NULL, then we probably don't need this check at
  all.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-14 Thread Greg Smith

On 7/13/13 12:13 PM, Fabien COELHO wrote:

My 0.02€: if it means adding complexity to the pgbench code, I think
that it is not worth it. The point of pgbench is to look at a steady
state, not to end in the most graceful possible way as far as measures
are concerned.


That's how some people use pgbench.  I'm just as likely to use it to 
characterize system latency.  If there's a source of latency that's 
specific to the pgbench code, I want that out of there even if it's hard.


But we don't have to argue about that because it isn't.  The attached 
new patch seems to fix the latency spikes at the end, with -2 lines of 
new code!  With that resolved I did a final pass across the rate limit 
code too, attached as a v14 and ready for a committer.  I don't really 
care what order these two changes are committed, there's no hard 
dependency, but I would like to see them both go in eventually.


No functional code was changed from your v13 except for tweaking the 
output.  The main thing I did was expand/edit comments and rename a few 
variables to try and make this easier to read.  If you have any 
objections to my cosmetic changes feel free to post an update.  I've put 
a good bit of time into trying to simplify this further, thinking it 
can't really be this hard.  But this seems to be the minimum complexity 
that still works given the mess of the pgbench state machine.  Every 
change I try now breaks something.


To wrap up the test results motivating my little pgbench-delay-finish 
patch, the throttled cases that were always giving 100ms of latency 
clustered at the end here now look like this:


average rate limit lag: 0.181 ms (max 53.108 ms)
tps = 10088.727398 (including connections establishing)
tps = 10105.775864 (excluding connections establishing)

There are still some of these cases where latency spikes, but they're 
not as big and they're randomly distributed throughout the run.  The 
problem I had with the ones at the end is how they tended to happen a 
few times in a row.  I kept seeing multiple of these ~50ms lulls adding 
up to a huge one, because the source of the lag kept triggering at every 
connection close.


pgbench was already cleaning up all of its connections at the end, after 
all the transactions were finished.  It looks safe to me to just rely on 
that for calling PQfinish in the normal case.  And calls to client_done 
already label themselves ok or abort, the code just didn't do anything 
with that state before.  I tried adding some more complicated state 
tracking, but that adds complexity while doing the exact same thing as 
the simple implementation I did.


The only part of your code change I reverted was altering the latency 
log transaction timestamps to read like 1373821907.65702 instead of 
1373821907 65702.  Both formats were considered when I added that 
feature, and I completely understand why you'd like to change it.  One 
problem is that doing so introduces a new class of float parsing and 
rounding issues to consumers of that data.  I'd rather not revisit that 
without a better reason to break the output format.  Parsing tools like 
my pgbench-tools already struggle trying to support multiple versions of 
pgbench, and I don't think there's enough benefit to the float format to 
bother breaking them today.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
new file mode 100644
index 08095a9..8dc81e5
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
*** preparedStatementName(char *buffer, int 
*** 862,875 
  static bool
  clientDone(CState *st, bool ok)
  {
!   /*
!* When the connection finishes normally, don't call PQfinish yet.
!* PQfinish can cause significant delays in other clients that are
!* still running.  Rather than finishing all of them here, in the
!* normal case clients are instead closed in bulk by disconnect_all,
!* after they have all stopped.
!*/
!   if ((st-con != NULL)  ok)
{
PQfinish(st-con);
st-con = NULL;
--- 862,870 
  static bool
  clientDone(CState *st, bool ok)
  {
!   (void) ok;  /* unused */
! 
!   if (st-con != NULL)
{
PQfinish(st-con);
st-con = NULL;
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 80203d6..da88bd7 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -137,6 +137,12 @@ intunlogged_tables = 0;
 double sample_rate = 0.0;
 
 /*
+ * When threads are throttled to a given rate limit, this is the target delay
+ * to reach that rate in usec.  0 is the default and means no throttling.
+ */
+int64  throttle_delay = 0;
+
+/*
  * tablespace selection
  */
 char  *tablespace = NULL;

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 6/16/13 10:27 AM, Heikki Linnakangas wrote:

Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images...
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.


That's exactly right.  When a checkpoint finishes the OS write cache is 
clean.  That means all of the full-page writes aren't even hitting disk 
in many cases.  They just pile up in the OS dirty memory, often sitting 
there all the way until when the next checkpoint fsyncs start.  That's 
why I never wandered down the road of changing FPW behavior.  I have 
never seen a benchmark workload hit a write bottleneck until long after 
the big burst of FPW pages is over.


I could easily believe that there are low-memory systems where the FPW 
write pressure becomes a problem earlier.  And slim VMs make sense as 
the place this behavior is being seen at.


I'm a big fan of instrumenting the code around a performance change 
before touching anything, as a companion patch that might make sense to 
commit on its own.  In the case of a change to FPW spacing, I'd want to 
see some diagnostic output in something like pg_stat_bgwriter that 
tracks how many FPW pages are being modified.  A 
pgstat_bgwriter.full_page_writes counter would be perfect here, and then 
graph that data over time as the benchmark runs.



Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.


There I also think the right way to proceed is instrumenting that area 
first.



A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.


I updated and re-reviewed that in 2011: 
http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com 
and commented on why I think the improvement was difficult to reproduce 
back then.  The improvement didn't follow for me either.  It would take 
a really amazing bit of data to get me to believe write sorting code is 
worthwhile after that.  On large systems capable of dirtying enough 
blocks to cause a problem, the operating system and RAID controllers are 
already sorting block.  And *that* sorting is also considering 
concurrent read requests, which are a lot more important to an efficient 
schedule than anything the checkpoint process knows about.  The database 
doesn't have nearly enough information yet to compete against OS level 
sorting.




Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).


For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.


Heikki has nailed the problem with the submitted dbt-2 results here.  If 
you spread checkpoints out more, you cannot fairly compare the resulting 
TPS or latency numbers anymore.


Simple example:  20 minute long test.  Server A does a checkpoint every 
5 minutes.  Server B has modified parameters or server code such that 
checkpoints happen every 6 minutes.  If you run both to completion, A 
will have hit 4 checkpoints that flush the buffer cache, B only 3.  Of 
course B will seem faster.  It didn't do as much work.


pgbench_tools measures the number of checkpoints during the test, as 
well as the buffer count statistics.  If those numbers are very 
different between two tests, I have to throw them out as unfair.  A lot 
of things that seem promising turn out to have this sort of problem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via 

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-14 Thread Fabien COELHO


Hello Greg,

But we don't have to argue about that because it isn't.  The attached new 
patch seems to fix the latency spikes at the end, with -2 lines of new code!


Hmmm... that looks like not too much complexity:-)

With that resolved I did a final pass across the rate limit code too, 
attached as a v14 and ready for a committer.


You attached my v13. Could you send your v14?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 6/27/13 11:08 AM, Robert Haas wrote:

I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well.


That's correct, I spent about a year whipping that particular horse and 
submitted improvements on it to the community. 
http://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.com 
and its updates downthread are good ones to compare this current work 
against.


The important thing to realize about just delaying fsync calls is that 
it *cannot* increase TPS throughput.  Not possible in theory, obviously 
doesn't happen in practice.  The most efficient way to write things out 
is to delay those writes as long as possible.  The longer you postpone a 
write, the more elevator sorting and write combining you get out of the 
OS.  This is why operating systems like Linux come tuned for such 
delayed writes in the first place.  Throughput and latency are linked; 
any patch that aims to decrease latency will probably slow throughput.


Accordingly, the current behavior--no delay--is already the best 
possible throughput.  If you apply a write timing change and it seems to 
increase TPS, that's almost certainly because it executed less 
checkpoint writes.  It's not a fair comparison.  You have to adjust any 
delaying to still hit the same end point on the checkpoint schedule. 
That's what my later submissions did, and under that sort of controlled 
condition most of the improvements went away.


Now, I still do really believe that better spacing of fsync calls helps 
latency in the real world.  Far as I know the server that I developed 
that patch for originally in 2010 is still running with that change. 
The result is not a throughput change though; there is a throughput drop 
with a latency improvement.  That is the unbreakable trade-off in this 
area if all you touch is scheduling.


The reason why I was ignoring this discussion and working on pgbench 
throttling until now is that you need to measure latency at a constant 
throughput to advance here on this topic, and that's exactly what the 
new pgbench feature enables.  If we can take the current checkpoint 
scheduler and an altered one, run both at exactly the same rate, and one 
gives lower latency, now we're onto something.  It's possible to do that 
with DBT-2 as well, but I wanted something really simple that people 
could replicate results with in pgbench.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-14 Thread Greg Smith

On 7/14/13 2:48 PM, Fabien COELHO wrote:

You attached my v13. Could you send your v14?


Correct patch (and the little one from me again) attached this time.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
new file mode 100644
index 08095a9..6564057
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
*** int unlogged_tables = 0;
*** 137,142 
--- 137,148 
  doublesample_rate = 0.0;
  
  /*
+  * When threads are throttled to a given rate limit, this is the target delay
+  * to reach that rate in usec.  0 is the default and means no throttling.
+  */
+ int64 throttle_delay = 0;
+ 
+ /*
   * tablespace selection
   */
  char *tablespace = NULL;
*** typedef struct
*** 200,210 
--- 206,218 
int listen; /* 0 indicates that an 
async query has been
 * sent */
int sleeping;   /* 1 indicates that the 
client is napping */
+   boolthrottling; /* whether nap is for throttling */
int64   until;  /* napping until (usec) */
Variable   *variables;  /* array of variable definitions */
int nvariables;
instr_time  txn_begin;  /* used for measuring 
transaction latencies */
instr_time  stmt_begin; /* used for measuring statement 
latencies */
+   boolis_throttled;   /* whether transaction should be 
throttled */
int use_file;   /* index in sql_files 
for this client */
boolprepared[MAX_FILES];
  } CState;
*** typedef struct
*** 222,227 
--- 230,238 
instr_time *exec_elapsed;   /* time spent executing cmds (per 
Command) */
int*exec_count; /* number of cmd executions 
(per Command) */
unsigned short random_state[3]; /* separate randomness for each 
thread */
+   int64   throttle_trigger;   /* previous/next throttling (us) */
+   int64   throttle_lag;   /* total transaction lag behind 
throttling */
+   int64   throttle_lag_max;   /* max transaction lag */
  } TState;
  
  #define INVALID_THREAD((pthread_t) 0)
*** typedef struct
*** 230,235 
--- 241,248 
  {
instr_time  conn_time;
int xacts;
+   int64   throttle_lag;
+   int64   throttle_lag_max;
  } TResult;
  
  /*
*** usage(void)
*** 353,358 
--- 366,372 
 -n, --no-vacuum  do not run VACUUM before tests\n
 -N, --skip-some-updates  skip updates of pgbench_tellers 
and pgbench_branches\n
 -r, --report-latencies   report average latency per 
command\n
+-R, --rate SPEC  target rate in transactions per 
second\n
 -s, --scale=NUM  report this scale factor in 
output\n
 -S, --select-onlyperform SELECT-only 
transactions\n
 -t, --transactions   number of transactions each 
client runs 
*** doCustom(TState *thread, CState *st, ins
*** 900,916 
  {
PGresult   *res;
Command   **commands;
  
  top:
commands = sql_files[st-use_file];
  
if (st-sleeping)
{   /* are we 
sleeping? */
instr_time  now;
  
INSTR_TIME_SET_CURRENT(now);
!   if (st-until = INSTR_TIME_GET_MICROSEC(now))
st-sleeping = 0;   /* Done sleeping, go ahead with 
next command */
else
return true;/* Still sleeping, nothing to 
do here */
}
--- 914,973 
  {
PGresult   *res;
Command   **commands;
+   booltrans_needs_throttle = false;
  
  top:
commands = sql_files[st-use_file];
  
+   /*
+* Handle throttling once per transaction by sleeping.  It is simpler
+* to do this here rather than at the end, because so much complicated
+* logic happens below when statements finish.
+*/
+   if (throttle_delay  ! st-is_throttled)
+   {
+   /*
+* Use inverse transform sampling to randomly generate a delay, 
such
+* that the series of delays will approximate a Poisson 
distribution
+* centered on the throttle_delay time.
+*
+* If transactions are too slow or a given wait is shorter than
+* a 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 7/3/13 9:39 AM, Andres Freund wrote:

I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files.


The fsync calls decomposing into the queued set of block writes.  If 
they all need to go out eventually to finish a checkpoint, the most 
efficient way from a throughput perspective is to dump them all at once.


I'm not sure sync_file_range targeting checkpoint writes will turn out 
any differently than block sorting.  Let's say the database tries to get 
involved in forcing a particular write order that way.  Right now it's 
going to be making that ordering decision without the benefit of also 
knowing what blocks are being read.  That makes it hard to do better 
than the OS, which knows a different--and potentially more useful in a 
ready-heavy environment--set of information about all the pending I/O. 
And it would be very expensive to made all the backends start sharing 
information about what they read to ever pull that logic into the 
database.  It's really easy to wander down the path where you assume you 
must know more than the OS does, which leads to things like direct I/O. 
 I am skeptical of that path in general.  I really don't want Postgres 
to be competing with the innovation rate in Linux kernel I/O if we can 
ride it instead.


One idea I was thinking about that overlaps with a sync_file_range 
refactoring is simply tracking how many blocks have been written to each 
relation.  If there was a rule like fsync any relation that's gotten 
more than 100 8K writes, we'd never build up the sort of backlog that 
causes the worst latency issues.  You really need to start tracking the 
file range there, just to fairly account for multiple writes to the same 
block.  One of the reasons I don't mind all the work I'm planning to put 
into block write statistics is that I think that will make it easier to 
build this sort of facility too.  The original page write and the fsync 
call that eventually flushes it out are very disconnected right now, and 
file range data seems the right missing piece to connect them well.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-07-14 Thread Noah Misch
While doing some post-commit review of the 9.3 materialized view feature, I
noticed a few loose ends:

On Thu, Jan 24, 2013 at 01:09:28PM -0500, Noah Misch wrote:
 Note that [...] ALTER TABLE ... RENAME CONSTRAINT [is]
 currently supported for MVs by ALTER TABLE but not by ALTER MATERIALIZED VIEW.
 
 There's no documented support for table constraints on MVs, but UNIQUE
 constraints are permitted:
 
 [local] test=# alter materialized view mymv add unique (c);
 ALTER MATERIALIZED VIEW
 [local] test=# alter materialized view mymv add check (c  0);
 ERROR:  mymv is not a table
 [local] test=# alter materialized view mymv add primary key (c);
 ERROR:  mymv is not a table or foreign table

The above points still apply.

Also, could you explain the use of RelationCacheInvalidateEntry() in
ExecRefreshMatView()?  CacheInvalidateRelcacheByRelid() followed by
CommandCounterIncrement() is the typical pattern; this is novel.  I suspect,
though, neither is necessary now that the relcache does not maintain populated
status based on a fork size reading.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread james

On 14/07/2013 20:13, Greg Smith wrote:
The most efficient way to write things out is to delay those writes as 
long as possible.


That doesn't smell right to me.  It might be that delaying allows more 
combining and allows the kernel to see more at once and optimise it, but 
I think the counter-argument is that it is an efficiency loss to have 
either CPU or disk idle waiting on the other.  It cannot make sense from 
a throughput point of view to have disks doing nothing and then become 
overloaded so they are a bottleneck (primarily seeking) and the CPU does 
nothing.


Now I have NOT measured behaviour but I'd observe that we see disks that 
can stream 100MB/s but do only 5% of that if they are doing random IO.  
Some random seeks during sync can't be helped, but if they are done when 
we aren't waiting for sync completion then they are in effect free.  The 
flip side is that we can't really know whether they will get merged with 
adjacent writes later so its hard to schedule them early.  But we can 
observe that if we have a bunch of writes to adjacent data then a seek 
to do the write is effectively amortised across them.


So it occurs to me that perhaps we can watch for patterns where we have 
groups of adjacent writes that might stream, and when they form we might 
schedule them to be pushed out early (if not immediately), ideally out 
as far as the drive (but not flushed from its cache) and without forcing 
all other data to be flushed too.  And perhaps we should always look to 
be getting drives dedicated to dbms to do something, even if it turns 
out to have been redundant in the end.


That's not necessarily easy on Linux without using a direct unbuffered 
IO but to me that is Linux' problem.  For a start its not the only 
target system, and having feedback 'we need' from db and mail system 
groups to the NT kernels devs hasn't hurt, and it never hurt Solaris to 
hear what Oracle and Sybase devs felt they needed either.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-14 Thread Fabien COELHO


Hello Greg,


Correct patch (and the little one from me again) attached this time.


Please find an updated v15 with only comment changes:

* The new comment about is_throttled was inverted wrt the meaning of the 
variable, at least to my understanding.


* ISTM that the impact of the chosen 1000 should appear somewhere.

* The trans_need_throttle comment was slightly wrong: the first 
transaction is also throttled, when initially entering doCustom.



About 123456 12345 vs 123456.012345: My data parser is usually gnuplot 
or my eyes, and in both cases the later option is better:-)



About the little patch: Parameter ok should be renamed to something 
meaningful (say do_finish?). Also, it seems that when timer is exceeded 
in doCustom it is called with true, but maybe you intended that it should 
be called with false?? Moreover, under normal circumstances (throttling is 
significantly below the maximum rate) PQfinish will be called directly by 
threadRun while interrupting sleeping threads. Also, it is important to 
remove the connection because it serves as a marker to know whether a 
client must run or not. So basically I do not understand how it works. 
Note that it does not mean that it does not work, it just means that I do 
not really understand:-)


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 8dc81e5..4b379ab 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -137,6 +137,12 @@ int			unlogged_tables = 0;
 double		sample_rate = 0.0;
 
 /*
+ * When threads are throttled to a given rate limit, this is the target delay
+ * to reach that rate in usec.  0 is the default and means no throttling.
+ */
+int64		throttle_delay = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -200,11 +206,13 @@ typedef struct
 	int			listen;			/* 0 indicates that an async query has been
  * sent */
 	int			sleeping;		/* 1 indicates that the client is napping */
+	boolthrottling; /* whether nap is for throttling */
 	int64		until;			/* napping until (usec) */
 	Variable   *variables;		/* array of variable definitions */
 	int			nvariables;
 	instr_time	txn_begin;		/* used for measuring transaction latencies */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	bool		is_throttled;	/* whether transaction throttling is done */
 	int			use_file;		/* index in sql_files for this client */
 	bool		prepared[MAX_FILES];
 } CState;
@@ -222,6 +230,9 @@ typedef struct
 	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
 	int		   *exec_count;		/* number of cmd executions (per Command) */
 	unsigned short random_state[3];		/* separate randomness for each thread */
+	int64   throttle_trigger; 	/* previous/next throttling (us) */
+	int64   throttle_lag; 		/* total transaction lag behind throttling */
+	int64   throttle_lag_max; 	/* max transaction lag */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -230,6 +241,8 @@ typedef struct
 {
 	instr_time	conn_time;
 	int			xacts;
+	int64   throttle_lag;
+	int64   throttle_lag_max;
 } TResult;
 
 /*
@@ -353,6 +366,7 @@ usage(void)
 		 -n, --no-vacuum  do not run VACUUM before tests\n
 		 -N, --skip-some-updates  skip updates of pgbench_tellers and pgbench_branches\n
 		 -r, --report-latencies   report average latency per command\n
+		 -R, --rate SPEC  target rate in transactions per second\n
 		 -s, --scale=NUM  report this scale factor in output\n
 		 -S, --select-onlyperform SELECT-only transactions\n
 		 -t, --transactions   number of transactions each client runs 
@@ -895,17 +909,62 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa
 {
 	PGresult   *res;
 	Command   **commands;
+	booltrans_needs_throttle = false;
 
 top:
 	commands = sql_files[st-use_file];
 
+	/*
+	 * Handle throttling once per transaction by sleeping.  It is simpler
+	 * to do this here rather than at the end, because so much complicated
+	 * logic happens below when statements finish.
+	 */
+	if (throttle_delay  ! st-is_throttled)
+	{
+		/*
+		 * Use inverse transform sampling to randomly generate a delay, such
+		 * that the series of delays will approximate a Poisson distribution
+		 * centered on the throttle_delay time.
+ *
+ * 1000 implies a 6.9 (-log(1/1000)) to 0.0 (log 1.0) delay multiplier.
+		 *
+		 * If transactions are too slow or a given wait is shorter than
+		 * a transaction, the next transaction will start right away.
+		 */
+		int64 wait = (int64)
+			throttle_delay * -log(getrand(thread, 1, 1000)/1000.0);
+
+		thread-throttle_trigger += wait;
+
+		st-until = thread-throttle_trigger;
+		st-sleeping = 1;
+		st-throttling = true;
+		st-is_throttled = true;
+		if (debug)
+			fprintf(stderr, client %d throttling INT64_FORMAT us\n,
+	st-id, wait);
+	}
+
 	if (st-sleeping)
 	{			/* are we 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith
On 7/11/13 8:29 AM, KONDO Mitsumasa wrote:
 I use linear combination method for considering about total checkpoint 
 schedule
 which are write phase and fsync phase. V3 patch was considered about only 
 fsync
 phase, V4 patch was considered about write phase and fsync phase, and v5 patch
 was considered about only fsync phase.

Your v5 now looks like my Self-tuning checkpoint sync spread series:
https://commitfest.postgresql.org/action/patch_view?id=514 which I did
after deciding write phase delays didn't help.  It looks to me like
some, maybe all, of your gain is coming from how any added delays spread
out the checkpoints.  The self-tuning part I aimed at was trying to
stay on exactly the same checkpoint end time even with the delays in
place.  I got that part to work, but the performance gain went away once
the schedule was a fair comparison.  You are trying to solve a very hard
problem.

How long are you running your dbt-2 tests for?  I didn't see that listed
anywhere.

 ** Average checkpoint duration (sec) (Not include during loading time)
   | write_duration | sync_duration | total
 fsync v3-0.7 | 296.6  | 251.8898  | 548.48 | OK
 fsync v3-0.9 | 292.086| 276.4525  | 568.53 | OK
 fsync v3-0.7_disabled| 303.5706   | 155.6116  | 459.18 | OK
 fsync v4-0.7 | 273.8338   | 355.6224  | 629.45 | OK
 fsync v4-0.9 | 329.0522   | 231.77| 560.82 | OK

I graphed the total times against the resulting NOTPM values and
attached that.  I expect transaction rate to increase along with time
time between checkpoints, and that's what I see here.  The fsync v4-0.7
result is worse than the rest for some reason, but all the rest line up
nicely.

Notice how fsync v3-0.7_disabled has the lowest total time between
checkpoints, at 459.18.  That is why it has the most I/O and therefore
runs more slowly than the rest.  If you take your fsync v3-0.7_disabled
and increase checkpoint_segments and/or checkpoint_timeout until that
test is averaging about 550 seconds between checkpoints, NOTPM should
also increase.  That's interesting to know, but you don't need any
change to Postgres for that.  That's what always happens when you have
less checkpoints per run.

If you get a checkpoint time table like this where the total duration is
very close--within +/-20 seconds is the sort of noise I would expect
there--at that point I would say you have all your patches on the same
checkpoint schedule.  And then you can compare the NOTPM numbers
usefully.  When the checkpoint times are in a large range like 459.18 to
629.45 in this table, as my graph shows the associated NOTPM numbers are
going to be based on that time.

I would recommend taking a snapshot of pg_stat_bgwriter before and after
the test runs, and then showing the difference between all of those
numbers too.  If the test runs for a while--say 30 minutes--the total
number of checkpoints should be very close too.

 * Test Server
Server: HP Proliant DL360 G7
CPU:Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk:   146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
(Add) Set off energy efficient function in BIOS and OS.

Excellent, here I have a DL160 G6 with 2 processors, 72GB of RAM, and
that same P410 controller + 4 disks.  I've been meaning to get DBT-2
running on there usefully, your research gives me a reason to do that.

You seem to be in a rush due to the commitfest schedule.  I have some
bad news for you there.  You're not going to see a change here committed
in this CF based on where it's at, so you might as well think about the
best longer term plan.  I would be shocked if anything came out of this
in less than 3 months really.  That's the shortest amount of time I've
ever done something useful in this area.  Each useful benchmarking run
takes me about 3 days of computer time, it's not a very fast development
cycle.

Even if all of your results were great, we'd need to get someone to
duplicate them on another server, and we'd need to make sure they didn't
make other workloads worse.  DBT-2 is very useful, but no one is going
to get a major change to the write logic in the database committed based
on one benchmark.  Past changes like this have used both DBT-2 and a
large number of pgbench tests to get enough evidence of improvement to
commit.  I can help with that part when you get to something I haven't
tried already.  I am very interesting in improving this area, it just
takes a lot of work to do it.

-- 
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
attachment: NOTPM-Checkpoints.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 7/14/13 5:28 PM, james wrote:

Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free.


That happens sometimes, but if you measure you'll find this doesn't 
actually occur usefully in the situation everyone dislikes.  In a write 
heavy environment where the database doesn't fit in RAM, backends and/or 
the background writer are constantly writing data out to the OS.  WAL is 
going out constantly as well, and in many cases that's competing for the 
disks too.  The most popular blocks in the database get high usage 
counts and they never leave shared_buffers except at checkpoint time. 
That's easy to prove to yourself with pg_buffercache.


And once the write cache fills, every I/O operation is now competing. 
There is nothing happening for free.  You're stealing I/O from something 
else any time you force a write out.  The optimal throughput path for 
checkpoints turns out to be delaying every single bit of I/O as long as 
possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever 
you delay a buffer write, you have increased the possibility that 
someone else will write the same block again.  And the buffers being 
written by the checkpointer are, on average, the most popular ones in 
the database.  Writing any of them to disk pre-emptively has high odds 
of writing the same block more than once per checkpoint.  And that easy 
to measure waste--it shows as more writes/transaction in 
pg_stat_bgwriter--it hurts throughput more than every reduction in seek 
overhead you might otherwise get from early writes.  The big gain isn't 
chasing after cheap seeks.  The best path is the one that decreases the 
total volume of writes.


We played this game with the background writer work for 8.3.  The main 
reason the one committed improved on the original design is that it 
completely eliminated doing work on popular buffers in advance. 
Everything happens at the last possible time, which is the optimal 
throughput situation.  The 8.1/8.2 BGW used to try and write things out 
before they were strictly necessary, in hopes that that I/O would be 
free.  But it rarely was, while there was always a cost to forcing them 
to disk early.  And that cost is highest when you're talking about the 
higher usage blocks the checkpointer tends to write.  When in doubt, 
always delay the write in hopes it will be written to again and you'll 
save work.



So it occurs to me that perhaps we can watch for patterns where we have
groups of adjacent writes that might stream, and when they form we might
schedule them...


Stop here.  I mentioned something upthread that is worth repeating.

The checkpointer doesn't know what concurrent reads are happening.  We 
can't even easily make it know, not without adding a whole new source of 
IPC and locking contention among clients.


Whatever scheduling decision the checkpointer might make with its 
limited knowledge of system I/O is going to be poor.  You might find a 
100% write benchmark that it helps, but those are not representative of 
the real world.  In any mixed read/write case, the operating system is 
likely to do better.  That's why things like sorting blocks sometimes 
seem to help someone, somewhere, with one workload, but then aren't 
repeatable.


We can decide to trade throughput for latency by nudging the OS to deal 
with its queued writes more regularly.  That will result in more total 
writes, which is the reason throughput drops.


But the idea that PostgreSQL is going to do a better global job of I/O 
scheduling, that road is a hard one to walk.  It's only going to happen 
if we pull all of the I/O into the database *and* do a better job on the 
entire process than the existing OS kernel does.  That sort of dream, of 
outperforming the filesystem, it is very difficult to realize.  There's 
a good reason that companies like Oracle stopped pushing so hard on 
recommending raw partitions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ECPG timestamp '%j'

2013-07-14 Thread Stephen Frost
Michael,

  While looking at complaints from the Coverity scanner system, it looks
  like it's detected a case in ECPG where we provide a day-of-year
  format option (%j), but we never actually calculate what the day of
  year *is*, resulting in an uninitialized value.

  Other parts of the code (non-ECPG) appears to realize that timestamp2tm
  doesn't fill in day-of-year in the struct and they calculate it
  afterwards.  Perhaps ECPG needs to adopt that approach also, perhaps
  either in dttofmtasc_replace() or PGTYPEStimestamp_fmt_asc()..?

  I was able to get what I believe is an incorrect result through a bit
  of hacking on the ECPG test cases:

  timestamp_fmt_asc: 0: 10922-abc%

  after adding:

out = (char*) malloc(32);
i = PGTYPEStimestamp_fmt_asc(ts1, out, 31, %j-abc%%);
printf(timestamp_fmt_asc: %d: %s\n, i, out);
free(out);

  into pgtypeslib/dt_test.pgc.

  If you don't have time to look into this, let me know and I'll try and
  get back to it soon.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Changing recovery.conf parameters into GUCs

2013-07-14 Thread Greg Smith

On 7/10/13 9:39 AM, Heikki Linnakangas wrote:

On 10.07.2013 02:54, Josh Berkus wrote:

One bit of complexity I'd like to see go away is that we have two
trigger files: one to put a server into replication, and one to promote
it.  The promotion trigger file is a legacy of warm standby, I believe.
  Maybe, now that we have pg_ctl promote available, we can eliminate the
promotion trigger?


No, see http://www.postgresql.org/message-id/5112a54b.8090...@vmware.com.


Right, you had some good points in there.  STONITH is so hard already, 
we need to be careful about eliminating options there.


All the summaries added here have actually managed to revive this one 
usefully early in the release cycle!  Well done.  I just tried to apply 
Michael's 20130325_recovery_guc_v3.patch and the bit rot isn't that bad 
either.  5 rejection files, nothing big in them.


The only overlap between the recovery.conf GUC work and refactoring the 
trigger file is that the trigger file is referenced in there, and we 
really need to point it somewhere to be most useful.



Personally, I don't care if we support the old recovery.conf-trigger
behavior as long as I'm not forced to use it.


If you accept Heikki's argument for why the file can't go away 
altogether, and it's pretty reasonable, I think two changes reach a 
point where everyone can live with this:


-We need a useful default filename for trigger_file to point at.
-pg_ctl failover creates that file.

As for the location to default to, the pg_standby docs suggest 
/tmp/pgsql.trigger.5432 while the Binary Replication Tutorial on the 
wiki uses a RedHat directory layout $PGDATA/failover


The main reason I've preferred something in the data directory is that 
triggering a standby is too catastrophic for me to be comfortable 
putting it in /tmp.  Any random hooligan with a shell account can 
trigger a standby and break its replication.  Putting the unix socket 
into /tmp only works because the server creates the file as part of 
startup.  Here that's not possible, because creating the trigger is the 
signalling mechanism.


I don't think there needs to be a CLI interface for putting the 
alternate possible text into the trigger--that you can ask for 'fast' 
startup.  It's nice to have available as an expert, but it's fine for 
that to be harder to do.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_memory_barrier() doesn't compile, let alone work, for me

2013-07-14 Thread Tom Lane
Commit 9a20a9b2 breaks the build for me, using gcc on HPPA:

xlog.c:2182: macro `pg_memory_barrier' used without args
xlog.c:2546: macro `pg_memory_barrier' used without args
make[4]: *** [xlog.o] Error 1

The reason for this is that the fallback definition of
pg_memory_barrier has been wrong since day one; it needs this fix:

-#define pg_memory_barrier(x) \
+#define pg_memory_barrier() \

However, fixing that doesn't yield much joy; initdb stalls and then
crashes with

PANIC:  stuck spinlock (40054a88) detected at xlog.c:2182

The reason for that is that the code does not bother to initialize
dummy_spinlock anywhere.  It might accidentally fail to fail
on machines where the unlocked state of a spinlock is all-zeroes
(given a compiler that's not picky about the incorrect macro usage)
... but HPPA is not such a machine.

Rather than trying to think of a fix for that, I'm of the opinion that
we should rip this out.  The fallback implementation of pg_memory_barrier
ought to be pg_compiler_barrier(), on the theory that non-mainstream
architectures don't have weak memory ordering anyway, or if they do you
need to do some work to get PG to work on them.  Or maybe we ought to
stop pretending that the code is likely to work on arbitrary machines,
and just #error if there's not a supplied machine-specific macro.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Jeff Janes
On Sunday, July 14, 2013, Greg Smith wrote:

 On 6/27/13 11:08 AM, Robert Haas wrote:

 I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
 it didn't work that well.


 That's correct, I spent about a year whipping that particular horse and
 submitted improvements on it to the community. http://www.postgresql.org/*
 *message-id/4D4F9A3D.5070700@**2ndquadrant.comhttp://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.comand
  its updates downthread are good ones to compare this current work
 against.

 The important thing to realize about just delaying fsync calls is that it
 *cannot* increase TPS throughput.  Not possible in theory, obviously
 doesn't happen in practice.  The most efficient way to write things out is
 to delay those writes as long as possible.  The longer you postpone a
 write, the more elevator sorting and write combining you get out of the OS.
  This is why operating systems like Linux come tuned for such delayed
 writes in the first place.  Throughput and latency are linked; any patch
 that aims to decrease latency will probably slow throughput.


Do common low level IO benchmarking tools cover this territory?  I've
looked at Bonnie, which seems to be the most famous one, and it doesn't
look like it covers effectiveness of write combining at all.

I've done my own casual benchmarking, and the results were astonishingly
bad for the OS/FS.  If I over-wrote 1024*1024 blocks of 8KB in random order
and then fsynced the 8GB of data (divided into 8x1GB files, in deference to
PG segment size) it took way longer than if I did the overwrite in block
order and then fsynced that.   This was a gift-horse machine not speced out
to be a database server, but the linux kernel is still the kernel
regardless of the hardware it sits on so I don't how much that should
matter.  To be clear, the writes did not take longer, it was the fsyncs
that took longer. All writes were successfully absorbed into memory
promptly.  Alas, I no longer have access to a machine which can absorb 8GB
of writes into RAM without thinking twice and which I can use for casual
experimentation.

Cheers,

Jeff


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Jeff Janes
On Sunday, July 14, 2013, Greg Smith wrote:

 On 7/14/13 5:28 PM, james wrote:

 Some random seeks during sync can't be helped, but if they are done when
 we aren't waiting for sync completion then they are in effect free.


 That happens sometimes, but if you measure you'll find this doesn't
 actually occur usefully in the situation everyone dislikes.  In a write
 heavy environment where the database doesn't fit in RAM, backends and/or
 the background writer are constantly writing data out to the OS.  WAL is
 going out constantly as well, and in many cases that's competing for the
 disks too.


While I think it is probably true that many systems don't separate WAL from
non-WAL to different IO controllers, is it true that many systems that are
in need of heavy IO tuning don't do so?  I thought that that would be the
first stop for any DBA of an highly IO-write constrained database.



  The most popular blocks in the database get high usage counts and they
 never leave shared_buffers except at checkpoint time. That's easy to prove
 to yourself with pg_buffercache.

 And once the write cache fills, every I/O operation is now competing.
 There is nothing happening for free.  You're stealing I/O from something
 else any time you force a write out.  The optimal throughput path for
 checkpoints turns out to be delaying every single bit of I/O as long as
 possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever you
 delay a buffer write, you have increased the possibility that someone else
 will write the same block again.   And the buffers being written by the
 checkpointer are, on average, the most popular ones in the database.
  Writing any of them to disk pre-emptively has high odds of writing the
 same block more than once per checkpoint.



Should the checkpointer make multiple passes over the buffer pool, writing
out the high usage_count buffers first, because no one else is going to do
it, and then going back for the low usage_count buffers in the hope they
were already written out?  On the other hand, if the checkpointer writes
out a low-usage buffer, why would anyone else need to write it again soon?
 If it were likely to get dirtied often, it wouldn't be low usage.  If it
was dirtied rarely, it wouldn't be dirty anymore once written.

Cheers,

Jeff


Re: [HACKERS] pg_memory_barrier() doesn't compile, let alone work, for me

2013-07-14 Thread Robert Haas
On Sun, Jul 14, 2013 at 8:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Commit 9a20a9b2 breaks the build for me, using gcc on HPPA:

 xlog.c:2182: macro `pg_memory_barrier' used without args
 xlog.c:2546: macro `pg_memory_barrier' used without args
 make[4]: *** [xlog.o] Error 1

 The reason for this is that the fallback definition of
 pg_memory_barrier has been wrong since day one; it needs this fix:

 -#define pg_memory_barrier(x) \
 +#define pg_memory_barrier() \

Uggh, sorry.

 However, fixing that doesn't yield much joy; initdb stalls and then
 crashes with

 PANIC:  stuck spinlock (40054a88) detected at xlog.c:2182

 The reason for that is that the code does not bother to initialize
 dummy_spinlock anywhere.  It might accidentally fail to fail
 on machines where the unlocked state of a spinlock is all-zeroes
 (given a compiler that's not picky about the incorrect macro usage)
 ... but HPPA is not such a machine.

This would not be hard to fix, I think.

 Rather than trying to think of a fix for that, I'm of the opinion that
 we should rip this out.  The fallback implementation of pg_memory_barrier
 ought to be pg_compiler_barrier(), on the theory that non-mainstream
 architectures don't have weak memory ordering anyway, or if they do you
 need to do some work to get PG to work on them.  Or maybe we ought to
 stop pretending that the code is likely to work on arbitrary machines,
 and just #error if there's not a supplied machine-specific macro.

Well, pg_memory_barrier() isn't even equivalent to
pg_compiler_barrier() on x86, which has among the strongest memory
orderings out there, so I think your first idea is a non-starter.  A
compiler barrier on x86 will fence reads from reads and writes from
writes, but it will not prevent a read from being speculated before a
subsequent write, which a full barrier must do.

I'm pretty sure we've got latent memory-ordering risks in our existing
code which we just haven't detected and fixed yet.  Consider, for
example, this exciting code from GetNewTransactionId:

myproc-subxids.xids[nxids] = xid;
mypgxact-nxids = nxids + 1;

I don't believe that's technically safe even on an architecture like
x86, because the compiler could decide to reorder those assignments.
Of course there is probably no reason to do so, and even if it does
you'd have to get really unlucky to see a user-visible failure, and if
you did you'd probably misguess the cause.

However, I'm guessing that as we start using memory barriers more, and
as we speed up the system generally, we're going to see bugs like this
that are currently hidden start to become visible from time to time.
Especially in view of that, I think it's wise to be pessimistic rather
than optimistic about what barriers are needed, when we don't know for
certain.  That way, when we get a memory-ordering-related bug report,
we can at least hope that the problem must be something specific to
the problem area rather than a general failure to use the right memory
barrier primitives.

My preference would be to fix this in a narrow way, by initializing
dummy_spinlock somewhere.  But if not, then I think #error is the only
safe way to go.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-07-14 Thread Robert Haas
On Wed, Jul 10, 2013 at 9:02 PM, Josh Berkus j...@agliodbs.com wrote:
 I think it's a waste of code to try to handle bushy trees.  A list is
 not a particularly efficient representation of the pending list; this
 will probably be slower than recusing in the common case.  I'd suggest
 keeping the logic to handle left-deep trees, which I find rather
 elegant, but ditching the pending list.

 Is there going to be further discussion of this patch, or do I return it?

Considering it's not been updated, nor my comments responded to, in
almost two weeks, I think we return it at this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-07-14 Thread Noah Misch
On Sun, Jul 07, 2013 at 06:37:26PM -0700, David Fetter wrote:
 On Sat, Jul 06, 2013 at 11:49:21AM +0100, Dean Rasheed wrote:
  Overall I think this patch offers useful additional functionality, in
  compliance with the SQL spec, which should be handy to simplify
  complex grouping queries.

As I understand this feature, it is syntactic sugar for the typical case of an
aggregate with a strict transition function.  For example, min(x) FILTER
(WHERE y  0) is rigorously equivalent to min(CASE y  0 THEN x END).
Every SQL aggregate is strict (ISO 9075-2 section 4.15.4), so for standard
queries it is *only* syntactic sugar.  In PostgreSQL, it lets you do novel
things with, for example, array_agg().  Is that accurate?

  I think this is ready for committer.

The patch was thorough.  I updated applicable comments, revisited some
cosmetic choices, and made these functional changes:

1. The pg_stat_statements query jumble should incorporate the filter.

2. The patch did not update costing.  I made it add the cost of the filter
expression the same way we add the cost of the argument expressions.  This
makes min(x) FILTER (WHERE y  0) match min(case y  0 THEN x end) in
terms of cost, which is a fair start.  At some point, we could do better by
reducing the argument cost by the filter selectivity.


A few choices/standard interpretations may deserve discussion.  The patch
makes filter clauses contribute to the subquery level chosen to be the
aggregation query.  This is observable through the behavior of these two
standard-conforming queries:

select (select count(outer_c)
from (values (1)) t0(inner_c))
from (values (2),(3)) t1(outer_c); -- outer query is aggregation query
select (select count(outer_c) filter (where inner_c  0)
from (values (1)) t0(inner_c))
from (values (2),(3)) t1(outer_c); -- inner query is aggregation query

I believe SQL (ISO 9075-2 section 6.9 SR 3,4,6) does require this.  Since that
still isn't crystal-clear to me, I mention it in case anyone has a different
reading.

Distinct from that, albeit in a similar vein, SQL does not permit outer
references in a filter clause.  This patch permits them; I think that
qualifies as a reasonable PostgreSQL extension.

 --- a/doc/src/sgml/keywords.sgml
 +++ b/doc/src/sgml/keywords.sgml

 @@ -3200,7 +3200,7 @@
 /row
 row
  entrytokenOVER/token/entry
 -entryreserved (can be function or type)/entry
 +entrynon-reserved/entry

I committed this one-line correction separately.

 --- a/src/backend/optimizer/plan/planagg.c
 +++ b/src/backend/optimizer/plan/planagg.c
 @@ -314,7 +314,7 @@ find_minmax_aggs_walker(Node *node, List **context)
   ListCell   *l;
  
   Assert(aggref-agglevelsup == 0);
 - if (list_length(aggref-args) != 1 || aggref-aggorder != NIL)
 + if (list_length(aggref-args) != 1 || aggref-aggorder != NIL 
 || aggref-agg_filter != NULL)
   return true;/* it couldn't be MIN/MAX */
   /* note: we do not care if DISTINCT is mentioned ... */

I twitched upon reading this, because neither ORDER BY nor FILTER preclude the
aggregate being MIN or MAX.  Perhaps Andrew can explain why he put aggorder
there back in 2009.  All I can figure is that writing max(c ORDER BY x) is so
unlikely that we'd too often waste the next syscache lookup.  But the same
argument would apply to DISTINCT.  With FILTER, the rationale is entirely
different.  The aggregate could well be MIN/MAX; we just haven't implemented
the necessary support elsewhere in this file.


See attached patch revisions.  The first patch edits find_minmax_aggs_walker()
per my comments just now.  The second is an update of your FILTER patch with
the changes to which I alluded above; it applies atop the first patch.  Would
you verify that I didn't ruin anything?  Barring problems, will commit.

Are you the sole named author of this patch?  That's what the CF page says,
but that wasn't fully clear to me from the list discussions.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***
*** 314,328  find_minmax_aggs_walker(Node *node, List **context)
ListCell   *l;
  
Assert(aggref-agglevelsup == 0);
!   if (list_length(aggref-args) != 1 || aggref-aggorder != NIL)
return true;/* it couldn't be MIN/MAX */
!   /* note: we do not care if DISTINCT is mentioned ... */
!   curTarget = (TargetEntry *) linitial(aggref-args);
  
aggsortop = fetch_agg_sort_op(aggref-aggfnoid);
if (!OidIsValid(aggsortop))
return true;/* not a MIN/MAX aggregate */
  
if (contain_mutable_functions((Node *) curTarget-expr))
return true;/* not potentially 

Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2

2013-07-14 Thread Robert Haas
On Fri, Jul 5, 2013 at 4:18 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-07-02 at 10:12 -0700, Jeff Davis wrote:
 Regardless, this is at least a concrete issue that I can focus on, and I
 appreciate that. Are scans of small tables the primary objection to this
 patch, or are there others? If I solve it, will this patch make real
 progress?

 I had an idea here:

 What if we keep PD_ALL_VISIBLE, but make it more like other hints, and
 make it optional? If a page is all visible, either or both of the VM bit
 or PD_ALL_VISIBLE could be set (please suspend disbelief for a moment).

 Then, we could use a heuristic, like setting PD_ALL_VISIBLE in the first
 256 pages of a relation; but in later pages, only setting it if the page
 is already dirty for some other reason.

 That has the following benefits:

 1. Eliminates the worry over contention related to scans, because we
 wouldn't need to use the VM for small tables. And I don't think anyone
 was worried about using the VM on a large table scan.

 2. Still avoids dirtying lots of pages after a data load. I'm not
 worried if a few MB of data is rewritten on a large table.

 3. Eliminates the complex way in which we (ab)use WAL and the recovery
 mechanism to keep PD_ALL_VISIBLE and the VM bit in sync.

 Of course, there's a reason that PD_ALL_VISIBLE is not like a normal
 hint: we need to make sure that inserts/updates/deletes clear the VM
 bit. But my patch already addresses that by keeping the VM page pinned.

I'm of the opinion that we ought to extract the parts of the patch
that hold the VM pin for longer, review those separately, and if
they're good and desirable, apply them.  Although that optimization
becomes more necessary if we were to adopt your proposal than it is
now, it's really separate from this patch.  Given that VM pin caching
can be done with or without removing PD_ALL_VISIBLE, it seems to me
that the fair comparison is between master + VM pin caching and master
+ VM pin caching + remove PD_ALL_VISIBLE.  Comparing the latter vs.
unpatched master seems to me to be confusing the issue.

 That has some weaknesses: as Heikki pointed out[1], you can defeat the
 cache of the pin by randomly seeking between 512MB regions during an
 update (would only be noticable if it's all in shared buffers already,
 of course). But even in that case, it was a fairly modest degradation
 (20%), so overall this seems like a fairly minor drawback.

I am not convinced.  I thought about the problem of repeatedly
switching pinned VM pages during the index-only scans work, and
decided that we could live with it because, if the table was large
enough that we were pinning VM pages frequently, we were also avoiding
I/O.  Of course, this is a logical fallacy, since the table could
easily be large enough to have quite a few VM pages and yet small
enough to fit in RAM.  And, indeed, at least in the early days, an
index scan could beat out an index-only scan by a significant margin
on a memory-resident table, precisely because of the added cost of the
VM lookups.  I haven't benchmarked lately so I don't know for sure
whether that's still the case, but I bet it is.

From your other email:
 I have a gut feeling that the complexity we go through to maintain
 PD_ALL_VISIBLE is unnecessary and will cause us problems later. If we
 could combine freezing and marking all-visible, and use WAL for
 PD_ALL_VISIBLE in a normal fashion, then I'd be content with that.

I think this idea is worth exploring, although I fear the overhead is
likely to be rather large.  We could find out, though.  Suppose we
simply change XLOG_HEAP2_VISIBLE to emit FPIs for the heap pages; how
much does that slow down vacuuming a large table into which many pages
have been bulk loaded?  Sadly, I bet it's rather a lot, but I'd like
to be wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add regression tests for COLLATE

2013-07-14 Thread Robert Haas
On Sat, Jul 6, 2013 at 9:27 AM, Andrew Dunstan and...@dunslane.net wrote:
 Or maybe just invent a magic result file name such as none or
 do_not_run.

 I'm not keen to have us build up a large patchwork of regression tests that
 run on some platforms and not on others, though.

Me, neither.  But I don't really see a plausible alternative in this
particular case, since we're trying to test a feature that only works
in the presence of specific platform support.  I don't think that
having no buildfarm testing of collation support is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-07-14 Thread David Fetter
On Sun, Jul 14, 2013 at 10:15:12PM -0400, Noah Misch wrote:
 On Sun, Jul 07, 2013 at 06:37:26PM -0700, David Fetter wrote:
  On Sat, Jul 06, 2013 at 11:49:21AM +0100, Dean Rasheed wrote:
   Overall I think this patch offers useful additional functionality, in
   compliance with the SQL spec, which should be handy to simplify
   complex grouping queries.
 
 As I understand this feature, it is syntactic sugar for the typical case of an
 aggregate with a strict transition function.  For example, min(x) FILTER
 (WHERE y  0) is rigorously equivalent to min(CASE y  0 THEN x END).
 Every SQL aggregate is strict (ISO 9075-2 section 4.15.4), so for standard
 queries it is *only* syntactic sugar.  In PostgreSQL, it lets you do novel
 things with, for example, array_agg().  Is that accurate?
 
   I think this is ready for committer.
 
 The patch was thorough.  I updated applicable comments, revisited some
 cosmetic choices, and made these functional changes:
 
 1. The pg_stat_statements query jumble should incorporate the filter.
 
 2. The patch did not update costing.  I made it add the cost of the filter
 expression the same way we add the cost of the argument expressions.  This
 makes min(x) FILTER (WHERE y  0) match min(case y  0 THEN x end) in
 terms of cost, which is a fair start.  At some point, we could do better by
 reducing the argument cost by the filter selectivity.

Thanks!

 A few choices/standard interpretations may deserve discussion.  The patch
 makes filter clauses contribute to the subquery level chosen to be the
 aggregation query.  This is observable through the behavior of these two
 standard-conforming queries:
 
 select (select count(outer_c)
 from (values (1)) t0(inner_c))
 from (values (2),(3)) t1(outer_c); -- outer query is aggregation query
 select (select count(outer_c) filter (where inner_c  0)
 from (values (1)) t0(inner_c))
 from (values (2),(3)) t1(outer_c); -- inner query is aggregation query
 
 I believe SQL (ISO 9075-2 section 6.9 SR 3,4,6) does require this.  Since that
 still isn't crystal-clear to me, I mention it in case anyone has a different
 reading.
 
 Distinct from that, albeit in a similar vein, SQL does not permit outer
 references in a filter clause.  This patch permits them; I think that
 qualifies as a reasonable PostgreSQL extension.

Thanks again.

  --- a/doc/src/sgml/keywords.sgml
  +++ b/doc/src/sgml/keywords.sgml
 
  @@ -3200,7 +3200,7 @@
  /row
  row
   entrytokenOVER/token/entry
  -entryreserved (can be function or type)/entry
  +entrynon-reserved/entry
 
 I committed this one-line correction separately.
 
  --- a/src/backend/optimizer/plan/planagg.c
  +++ b/src/backend/optimizer/plan/planagg.c
  @@ -314,7 +314,7 @@ find_minmax_aggs_walker(Node *node, List **context)
  ListCell   *l;
   
  Assert(aggref-agglevelsup == 0);
  -   if (list_length(aggref-args) != 1 || aggref-aggorder != NIL)
  +   if (list_length(aggref-args) != 1 || aggref-aggorder != NIL 
  || aggref-agg_filter != NULL)
  return true;/* it couldn't be MIN/MAX */
  /* note: we do not care if DISTINCT is mentioned ... */
 
 I twitched upon reading this, because neither ORDER BY nor FILTER preclude the
 aggregate being MIN or MAX.  Perhaps Andrew can explain why he put aggorder
 there back in 2009.  All I can figure is that writing max(c ORDER BY x) is so
 unlikely that we'd too often waste the next syscache lookup.  But the same
 argument would apply to DISTINCT.  With FILTER, the rationale is entirely
 different.  The aggregate could well be MIN/MAX; we just haven't implemented
 the necessary support elsewhere in this file.

Excellent reasoning, and good catch.

 See attached patch revisions.  The first patch edits
 find_minmax_aggs_walker() per my comments just now.  The second is
 an update of your FILTER patch with the changes to which I alluded
 above; it applies atop the first patch.  Would you verify that I
 didn't ruin anything?  Barring problems, will commit.
 
 Are you the sole named author of this patch?  That's what the CF
 page says, but that wasn't fully clear to me from the list
 discussions.

While Andrew's help was invaluable and pervasive, I wrote the patch,
and all flaws in it are my fault.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-14 Thread Robert Haas
On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Yes, what I know almost all use utf8 without problems. Long time I didn't
 see any request for multi encoding support.

Well, not *everything* can be represented as UTF-8; I think this is
particularly an issue with Asian languages.

If we chose to do it, I think that per-column encoding support would
end up looking a lot like per-column collation support: it would be
yet another per-column property along with typoid, typmod, and
typcollation.  I'm not entirely sure it's worth it, although FWIW I do
believe Oracle has something like this.  At any rate, it seems like
quite a lot of work.

Another idea would be to do something like what we do for range types
- i.e. allow a user to declare a type that is a differently-encoded
version of some base type.  But even that seems pretty hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-14 Thread Robert Haas
On Fri, Jul 5, 2013 at 4:27 PM, Markus Wanner mar...@bluegap.ch wrote:
 One way to resolve that - and that seems to be the direction Dimitri's
 patch is going - is to make the extension depend on its template. (And
 thereby breaking the mental model of a template, IMO. In the spirit of
 that mental model, one could argue that code for stored procedures
 shouldn't be duplicated, but instead just reference its ancestor.)

The security problems with this model have been discussed in every
email thread about the extension template feature.  There is a lot of
(well-founded, IMHO) resistance to the idea of letting users install C
code via an SQL connection.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-07-14 Thread Gurjeet Singh
On Sun, Jul 14, 2013 at 8:27 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jul 10, 2013 at 9:02 PM, Josh Berkus j...@agliodbs.com wrote:
  I think it's a waste of code to try to handle bushy trees.  A list is
  not a particularly efficient representation of the pending list; this
  will probably be slower than recusing in the common case.  I'd suggest
  keeping the logic to handle left-deep trees, which I find rather
  elegant, but ditching the pending list.


Somehow I find it hard to believe that recursing would be more efficient
than processing the items right there. The recursion is not direct either;
transformExprRecurse() is going to call this function again, but after a
few more switch-case comparisons.

Agreed that there's overhead in allocating list items, but is it more
overhead than pushing functions on the call stack? Not sure, so I leave it
to others who understand such things better than I do.

If by common-case you mean a list of just one logical AND/OR operator, then
I agree that creating and destroying a list may incur overhead that is
relatively very expensive. To that end, I have altered the patch, attached,
to not build a pending list until we encounter a node with root_expr_kind
in a right branch.

We're getting bushy-tree processing with very little extra code, but if you
deem it not worthwhile or adding complexity, please feel free to rip it out.


  
  Is there going to be further discussion of this patch, or do I return it?

 Considering it's not been updated, nor my comments responded to, in
 almost two weeks, I think we return it at this point.


Sorry, I didn't notice that this patch was put back in  'Waiting on Author'
state.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-14 Thread Arulappan, Arul Shaji
 
 On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule
pavel.steh...@gmail.com wrote:
  Yes, what I know almost all use utf8 without problems. Long time I
  didn't see any request for multi encoding support.
 
 Well, not *everything* can be represented as UTF-8; I think this is
 particularly an issue with Asian languages.
 
 If we chose to do it, I think that per-column encoding support would
end up
 looking a lot like per-column collation support: it would be yet
another per-
 column property along with typoid, typmod, and typcollation.  I'm not
entirely
 sure it's worth it, although FWIW I do believe Oracle has something
like this.

Yes, the idea is that users will be able to declare columns of type
NCHAR or NVARCHAR which will use the pre-determined encoding type. If we
say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding
irrespective of the database encoding. It will be up to us to restrict
what Unicode encodings we want to support for NCHAR/NVARCHAR columns.
This is based on my interpretation of the SQL standard. As you allude to
above, Oracle has a similar behaviour (they support UTF-16 as well).  

Support for UTF-16 will be difficult without linking with some external
libraries such as ICU. 


 At any rate, it seems like quite a lot of work.

Thanks for putting my mind at ease ;-)

Rgds,
Arul Shaji


 
 Another idea would be to do something like what we do for range types
 - i.e. allow a user to declare a type that is a differently-encoded
version of
 some base type.  But even that seems pretty hard.
 
 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers