Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-20 Thread didier
Hi

With your tests did you try to write the hot buffers first? ie buffers with
a high  refcount, either by sorting them on refcount or at least sweeping
the buffer list in reverse?

In my understanding there's an 'impedance mismatch' between what postgresql
wants and what the OS offers.
when it called fsync() Postresql wants a set of buffers selected quickly at
checkpoint start time written to disks, but  the OS only offers to write
all dirties buffers at fsync time, not exactly the same contract, on a
loaded server with checkpoint spreading the difference could be big, worst
case checkpoint want 8KB fsync write 1GB.

As a control, there's 150 years of math, up to Maxwell himself, behind t
Adding as little energy (packets) as randomly as possible to a control
system you couldn't measure actuators do make a

by writing to the OS the less likely to be recycle buffers first it may
have less work to do at fsync time, hopefully they have been written by the
OS background task during the spread and are not re-dirtied by other
backends.

Didier


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-21 Thread didier
On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote:

 On 7/20/13 4:48 AM, didier wrote:

 With your tests did you try to write the hot buffers first? ie buffers
 with a high  refcount, either by sorting them on refcount or at least
 sweeping the buffer list in reverse?


 I never tried that version.  After a few rounds of seeing that all changes
 I tried were just rearranging the good and bad cases, I got pretty bored
 with trying new changes in that same style.


  by writing to the OS the less likely to be recycle buffers first it may
 have less work to do at fsync time, hopefully they have been written by
 the OS background task during the spread and are not re-dirtied by other
 backends.


 That is the theory.  In practice write caches are so large now, there is
 almost no pressure forcing writes to happen until the fsync calls show up.
  It's easily possible to enter the checkpoint fsync phase only to discover
 there are 4GB of dirty writes ahead of you, ones that have nothing to do
 with the checkpoint's I/O.

 Backends are constantly pounding the write cache with new writes in
 situations with checkpoint spikes.  The writes and fsync calls made by the
 checkpoint process are only a fraction of the real I/O going on. The volume
 of data being squeezed out by each fsync call is based on total writes to
 that relation since the checkpoint.  That's connected to the writes to that
 relation happening during the checkpoint, but the checkpoint writes can
 easily be the minority there.

 It is not a coincidence that the next feature I'm working on attempts to
 quantify the total writes to each 1GB relation chunk.  That's the most
 promising path forward on the checkpoint problem I've found.


 --
 Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
 PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-21 Thread didier
Hi,

On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote:

 On 7/20/13 4:48 AM, didier wrote:


 That is the theory.  In practice write caches are so large now, there is
 almost no pressure forcing writes to happen until the fsync calls show up.
  It's easily possible to enter the checkpoint fsync phase only to discover
 there are 4GB of dirty writes ahead of you, ones that have nothing to do
 with the checkpoint's I/O.

 Isn't adding another layer of cache the usual answer?

The best would be in the OS, a fs with a big journal able to write
sequentially a lot of blocks.

If not and If you can spare at worst 2bit in memory per data blocks,  don't
mind preallocated data files (assuming meta data are stable then) and have
a working mmap(  MAP_NONBLOCK), and mincore() syscalls you could have a
checkpoint in bound time, worst case you sequentially write the whole
server RAM to a separate disk every checkpoint.
Not sure I would trust such a beast with my data though :)


Didier


[HACKERS] small typo in src/backend/access/transam/xlog.c

2013-07-22 Thread didier
Hi

in void
BootStrapXLOG(void)

  * to seed it other than the system clock value...)  The upper half of
the
 * uint64 value is just the tv_sec part, while the lower half is
the XOR
 * of tv_sec and tv_usec.  This is to ensure that we don't lose
uniqueness
 * unnecessarily if uint64 is really only 32 bits wide.  A person
 * knowing this encoding can determine the initialization time of
the
 * installation, which could perhaps be useful sometimes.
 */
gettimeofday(tv, NULL);
sysidentifier = ((uint64) tv.tv_sec)  32;
sysidentifier |= (uint32) (tv.tv_sec | tv.tv_usec);

should be
sysidentifier |= (uint32) (tv.tv_sec ^ tv.tv_usec);

Regards
Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread didier
Hi


On Tue, Jul 23, 2013 at 5:48 AM, Greg Smith g...@2ndquadrant.com wrote:

 Recently I've been dismissing a lot of suggested changes to checkpoint
 fsync timing without suggesting an alternative.  I have a simple one in
 mind that captures the biggest problem I see:  that the number of backend
 and checkpoint writes to a file are not connected at all.

 We know that a 1GB relation segment can take a really long time to write
 out.  That could include up to 128 changed 8K pages, and we allow all of
 them to get dirty before any are forced to disk with fsync.

 It was surely already discussed but why isn't postresql  writing
sequentially its cache in a temporary file? With storage random speed at
least five to ten time slower it could help a lot.
Thanks

Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread didier
Hi,


 Sure, that's what the WAL does.  But you still have to checkpoint
 eventually.

 Sure, when you run  pg_ctl stop.
Unlike the WAL it only needs two files, shared_buffers size.

I did bogus tests by replacing  mask |= BM_PERMANENT; with mask = -1 in
BufferSync() and simulating checkpoint with a periodic dd if=/dev/zero
of=foo  conv=fsync

On a saturated storage with %usage locked solid at 100% I got up to 30%
speed improvement and fsync latency down by one order of magnitude, some
fsync were still slow of course if buffers were already in OS cache.

But it's the upper bound, it's was done one a slow storage with bad ratios
: (OS cache write)/(disk sequential write) in 50, (sequential
write)/(effective random write) in 10 range and a proper implementation
would have a 'little' more work to do... (only checkpoint task can write
BM_CHECKPOINT_NEEDED buffers keeping them dirty and so on)

Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi,


On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith g...@2ndquadrant.com wrote:

 On 7/25/13 6:02 PM, didier wrote:

 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file?


 If you do that, reads of the data will have to traverse that temporary
 file to assemble their data.  You'll make every later reader pay the random
 I/O penalty that's being avoided right now.  Checkpoints are already
 postponing these random writes as long as possible. You have to take care
 of them eventually though.


 No the log file is only used at recovery time.

in check point code:
- loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in
current code
- other workers can't write and evicted these marked buffers to disk,
there's a race with fsync.
- check point fsync now or after the next step.
- check point loop again save to log these buffers, clear
BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers
will be written again, as they are when check point isn't running.
- check point done.

During recovery you have to load the log in cache first before applying WAL.

Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi,


On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith
greg@2ndquadg...@2ndquadrant.com(needrant.comg...@2ndquadrant.com
 wrote:

 On 7/26/13 9:14 AM, didier wrote:

 During recovery you have to load the log in cache first before applying
 WAL.


 Checkpoints exist to bound recovery time after a crash.  That is their
 only purpose.  What you're suggesting moves a lot of work into the recovery
 path, which will slow down how long it takes to process.

 Yes it's slower but you're sequentially reading only one file at most the
size of your buffer cache, moreover it's a constant time.

Let say you make a checkpoint and crash just after with a next to empty
WAL.

Now recovery  is very fast but you have to repopulate your cache with
random reads from requests.

With the snapshot it's slower but you read, sequentially again, a lot of
hot cache you will need later when the db starts serving requests.

Of course the worst case is if it crashes just before a checkpoint, most of
the snapshot data are stalled and will be overwritten by WAL ops.

But  If the WAL recovery is CPU bound, loading from the snapshot may be
done concurrently while replaying the WAL.

More work at recovery time means someone who uses the default of
 checkpoint_timeout='5 minutes', expecting that crash recovery won't take
 very long, will discover it does take a longer time now.  They'll be forced
 to shrink the value to get the same recovery time as they do currently.
  You might need to make checkpoint_timeout 3 minutes instead, if crash
 recovery now has all this extra work to deal with.  And when the time
 between checkpoints drops, it will slow the fundamental efficiency of
 checkpoint processing down.  You will end up writing out more data in the
 end.

Yes it's a trade off, now you're paying the price at checkpoint time, every
time,  with the log you're paying only once, at recovery.


 The interval between checkpoints and recovery time are all related.  If
 you let any one side of the current requirements slip, it makes the rest
 easier to deal with.  Those are all trade-offs though, not improvements.
  And this particular one is already an option.

 If you want less checkpoint I/O per capita and don't care about recovery
 time, you don't need a code change to get it.  Just make checkpoint_timeout
 huge.  A lot of checkpoint I/O issues go away if you only do a checkpoint
 per hour, because instead of random writes you're getting sequential ones
 to the WAL.  But when you crash, expect to be down for a significant chunk
 of an hour, as you go back to sort out all of the work postponed before.

It's not the same  it's a snapshot saved and loaded in constant time unlike
the WAL log.

Didier


Re: [HACKERS] Properly initialize negative/empty cache entries in relfilenodemap

2013-08-29 Thread didier
Hi,

On Thu, Aug 29, 2013 at 2:35 PM, MauMau maumau...@gmail.com wrote:


 Great!  Could anybody find the root cause for the following memory leak
 problem, and if possible, fix this?

 http://www.postgresql.org/**message-id/**214653D8DF574BFEAA6ED53E545E99**
 E4@maumauhttp://www.postgresql.org/message-id/214653D8DF574BFEAA6ED53E545E99E4@maumau

 Heiki helped to solve this and found that pg_statistic entries are left in
 CacheMemoryContext, but we have no idea where and how they are created and
 left.  This seems difficult to me.

 VALGRIND  won't help you for this one
You hit 2 issues
- user can create negative cache entries in pg_statistic with SELECT but
they are unbound (at first there was a LRU aging but it was removed in 2006)

- if there's no row in pg_statistic for a relation/column then
RemoveStatistics, called by DROP ..., doesn't invalidate the cache (which
should remove these negative entries).


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread didier
Hi


On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
  I think we should go through the various implementations and make sure
  they are actual compiler barriers and then change the documented policy.

 From a quick look
 * S_UNLOCK for PPC isn't a compiler barrier
 * S_UNLOCK for MIPS isn't a compiler barrier
 * I don't know enough about unixware (do we still support that as a
 platform even) to judge
 * True64 Alpha I have no clue about
 * PA-RISCs tas() might not be a compiler barrier for !GCC
 * PA-RISCs S_UNLOCK might not be a compiler barrier
 * HP-UX !GCC might not
 * IRIX 5 seems to be a compiler barrier
 * SINIX - I don't care
 * AIX PPC - compiler barrier
 * Sun - TAS is implemented in external assembly, normal function call,
   compiler barrier
 * Win(32|64) - compiler barrier
 * Generic S_UNLOCK *NOT* necessarily a compiler barrier.

 Ok, so I might have been a bit too optimistic...

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread didier
Hi,

IMO it's a bug if S_UNLOCK is a not a compiler barrier.

Moreover for volatile remember:
https://www.securecoding.cert.org/confluence/display/seccode/DCL17-C.+Beware+of+miscompiled+volatile-qualified+variables

Who is double checking compiler output? :)

regards
Didier



On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
  I think we should go through the various implementations and make sure
  they are actual compiler barriers and then change the documented policy.

 From a quick look
 * S_UNLOCK for PPC isn't a compiler barrier
 * S_UNLOCK for MIPS isn't a compiler barrier
 * I don't know enough about unixware (do we still support that as a
 platform even) to judge
 * True64 Alpha I have no clue about
 * PA-RISCs tas() might not be a compiler barrier for !GCC
 * PA-RISCs S_UNLOCK might not be a compiler barrier
 * HP-UX !GCC might not
 * IRIX 5 seems to be a compiler barrier
 * SINIX - I don't care
 * AIX PPC - compiler barrier
 * Sun - TAS is implemented in external assembly, normal function call,
   compiler barrier
 * Win(32|64) - compiler barrier
 * Generic S_UNLOCK *NOT* necessarily a compiler barrier.

 Ok, so I might have been a bit too optimistic...

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



[HACKERS] trivial one-off memory leak in guc-file.l ParseConfigFile

2013-09-22 Thread didier
Hi

fix a small memory leak in guc-file.l ParseConfigFile

AbsoluteConfigLocation() return a strdup string but it's never free or
referenced outside ParseConfigFile

Courtesy Valgrind and Noah Misch MEMPOOL work.

Regards
Didier


memory_leak_in_parse_config_file.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OSX doesn't accept identical source/target for strcpy() anymore

2013-10-28 Thread didier
Hi,


On Mon, Oct 28, 2013 at 7:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:


 If copying takes place between objects that overlap, the behavior is
 undefined.

 Both gcc and glibc have been moving steadily in the direction of
 aggressively exploiting undefined behavior cases for optimization
 purposes.  I don't know if there is yet a platform where strncpy with
 src == dest behaves oddly, but we'd be foolish to imagine that it's
 not going to happen eventually.  If anything, Apple is probably doing
 us a service by making it obvious where we're failing to adhere to spec.

 However ... I still can't replicate this here, and as you say, there's
 about zero chance of keeping our code clean of this problem unless we
 can set up a buildfarm member that will catch it.

 regards, tom lane


I haven't a 10.9 box for double checking but there's a gcc command line
triggering the same assert for strcpy and gcc at
http://lists.gnu.org/archive/html/bug-bash/2013-07/msg00011.html.

Didier


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-05 Thread didier
hi

On Thu, Sep 4, 2014 at 7:01 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma a...@cybertec.at wrote:
 On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 It's imo quite clearly better to keep it allocated. For one after
 postmaster started the checkpointer successfully you don't need to be
 worried about later failures to allocate memory if you allocate it once
 (unless the checkpointer FATALs out which should be exceedingly rare -
 we're catching ERRORs). It's much much more likely to succeed
 initially. Secondly it's not like there's really that much time where no
 checkpointer isn't running.

 In principle you could do the sort with the full sized array and then
 compress it to a list of buffer IDs that need to be written out. This
 way most of the time you only need a small array and the large array
 is only needed for a fraction of a second.

 It's not the size of the array that's the problem; it's the size of
 the detonation when the allocation fails.

You can use a file backed memory array
Or because it's only a hint and
- keys are in buffers (BufferTag), right?
- transition is only from 'data I care to data I don't care' if a
buffer is concurrently evicted when sorting.

Use a pre allocate buffer index array an read keys from buffers when
sorting, without memory barrier, spinlocks, whatever.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise() and pg_receivexlog

2014-09-09 Thread didier
Hi

 Well, I'd like to hear someone from the field complaining that
 pg_receivexlog is thrashing the cache and thus reducing the performance of
 some other process. Or a least a synthetic test case that demonstrates that
 happening.
It's not with pg_receivexlog but it's related.

On a small box without replication server connected perfs were good
enough but not so with a replication server connected, there was 1GB
worth of WAL sitting in RAM vs next to nothing without slave!
setup:
8GB RAM
2GB shared_buffers (smaller has other issues)
checkpoint_segments 40 (smaller value trigger too much xlog checkpoint)
checkpoints spread over 10 mn and write 30 to 50% of shared buffers.
live data set fit in RAM.
constant load.

On startup (1 or 2/hour) applications were running requests on cold
data which were now saturating IO.
I'm not sure it's an OS bug as the WAL were 'hotter' than the cold data.

A cron task every minute with vmtouch -e for evicting old WAL files
from memory has solved the issue.

Regards


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal: adding a GUC for BAS_BULKREAD strategy

2014-09-23 Thread didier
Hi,

Currently the value is hard code to NBuffers / 4 but ISTM that with
bigger shared_buffer it's too much, ie even with a DB 10 to 20 time
the memory size there's a lot of tables under this limit and nightly
batch reports are trashing the shared buffers cache as if there's no
tomorrow.


regards,


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failback to old master

2014-11-16 Thread didier
Hi,


On Sat, Nov 15, 2014 at 5:31 PM, Maeldron T. maeld...@gmail.com wrote:
 A safely shut down master (-m fast is safe) can be safely restarted as
 a slave to the newly promoted master. Fast shutdown shuts down all
 normal connections, does a shutdown checkpoint and then waits for this
 checkpoint to be replicated to all active streaming clients. Promoting
 slave to master creates a timeline switch, that prior to version 9.3
 was only possible to replicate using the archive mechanism. As of
 version 9.3 you don't need to configure archiving to follow timeline
 switches, just add a recovery.conf to the old master to start it up as
 a slave and it will fetch everything it needs from the new master.

 I took your advice and I understood that removing the recovery.conf followed
 by a restart is wrong. I will not do that on my production servers.

 However, I can't make it work with promotion. What did I wrong? It was
 9.4beta3.

 mkdir 1
 mkdir 2
 initdb -D 1/
 edit config: change port, wal_level to hot_standby, hot_standby to on,
 max_wal_senders=7, wal_keep_segments=100, uncomment replication in hba.conf
 pg_ctl -D 1/ start
 createdb -p 5433
 psql -p 5433
 pg_basebackup -p 5433 -R -D 2/
 mcedit 2/postgresql.conf change port
 chmod -R 700 1
 chmod -R 700 2
 pg_ctl -D 2/ start
 psql -p 5433
 psql -p 5434
 everything works
 pg_ctl -D 1/ stop
 pg_ctl -D 2/ promote
 psql -p 5434
 cp 2/recovery.done 1/recovery.conf
 mcedit 1/recovery.conf change port
 pg_ctl -D 1/ start

 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 1 at 0/3000AE0.
 LOG:  restarted WAL streaming at 0/300 on timeline 1
 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 1 at 0/3000AE0.

 This is what I experienced in the past when I tried with promote. The old
 master disconnects from the new. What am I missing?

I think you have to add
recovery_target_timeline = '2'
in recovery.conf
with '2' being the new primary timeline .
cf http://www.postgresql.org/docs/9.4/static/recovery-target-settings.html

Didier


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriter active during recovery

2014-12-17 Thread didier
Hi,

On Tue, Dec 16, 2014 at 6:07 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 16 December 2014 at 14:12, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 On 12/15/2014 08:51 PM, Simon Riggs wrote:

 Currently, WALReceiver writes and fsyncs data it receives. Clearly,
 while we are waiting for an fsync we aren't doing any other useful
 work.

 Following patch starts WALWriter during recovery and makes it
 responsible for fsyncing data, allowing WALReceiver to progress other
 useful actions.
On many Linux systems it may not do that much (2.6.32 and 3.2 are bad,
3.13 is better but still it slows the fsync).

If there's a fsync in progress WALReceiver will:
1- slow the fsync because its writes to the same file are grabbed by the fsync
2- stall until the end of fsync.

from 'stracing' a test program simulating this pattern:
two processes, one writes to a file the second fsync it.

20279 11:51:24.037108 fsync(5 unfinished ...
20278 11:51:24.053524 ... nanosleep resumed NULL) = 0 0.020281
20278 11:51:24.053691 lseek(3, 1383612416, SEEK_SET) = 1383612416 0.000119
20278 11:51:24.053965 write(3, ...,
8192) = 8192 0.000111
20278 11:51:24.054190 nanosleep({0, 2000}, NULL) = 0 0.020243

20278 11:51:24.404386 lseek(3, 194772992, SEEK_SET unfinished ...
20279 11:51:24.754123 ... fsync resumed ) = 0 0.716971
20279 11:51:24.754202 close(5 unfinished ...
20278 11:51:24.754232 ... lseek resumed ) = 194772992 0.349825

Yes that's a 300ms lseek...



 What other useful actions can WAL receiver do while it's waiting? It doesn't
 do much else than receive WAL, and fsync it to disk.

 So now it will only need to do one of those two things.


Regards
Didier


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriter active during recovery

2014-12-17 Thread didier
Hi

On Wed, Dec 17, 2014 at 2:39 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 didier wrote:

 On many Linux systems it may not do that much (2.6.32 and 3.2 are bad,
 3.13 is better but still it slows the fsync).

 If there's a fsync in progress WALReceiver will:
 1- slow the fsync because its writes to the same file are grabbed by the 
 fsync
 2- stall until the end of fsync.

 Is this behavior filesystem-dependent?
I don't know. I only tested  ext4

Attach the trivial code I used, there's a lot of junk in it.

Didier
/*
* Compile with: gcc testf.c -Wall -W -O0 
*/
 
#include stdio.h
#include unistd.h
#include string.h
#include sys/types.h
#include sys/stat.h
#include sys/fcntl.h
#include sys/time.h
#include stdlib.h
#include stdint.h
#include sys/file.h
#include errno.h
 
static long long microseconds(void) {
   struct timeval tv;
   long long mst;
 
   gettimeofday(tv, NULL);
   mst = ((long long)tv.tv_sec)*100;
   mst += tv.tv_usec;
   return mst;
}
int out= 0;
//#define FLOCK(a,b) flock(a,b)
#define FLOCK(a,b) (0)
 
//==
// fsync  
void child(void) {
  int fd, retval;
  long long start;
 
  while(1) {
 fd = open(/tmp/foo.txt,O_RDONLY);
 //usleep(3000);
 usleep(500);
 FLOCK(fd, LOCK_EX);
 if (out) {
   printf(Start sync\n);
   fflush(stdout);
   start = microseconds();
 }
 retval = fsync(fd);
 FLOCK(fd, LOCK_UN);
 if (out) {
   printf(Sync in %lld microseconds (%d)\n, microseconds()-start,retval);
   fflush(stdout);
 }  
 close(fd);
   }
   exit(0);
}

char buf[8*1024];
#define f_size (2lu*1024*1024*1024)

//==
// read
void child2(void) {
   int fd, retval;
   long long start;
   off_t lfsr;

   fd = open(/tmp/foo.txt,O_RDWR /*|O_CREAT | O_SYNC*/,0644);
   srandom(2000 +time(NULL));
   while(1) {
  if (out) {
start = microseconds();
  }
  lfsr = random()/sizeof(buf);
  if (pread (fd, buf, sizeof(buf), sizeof(buf)*lfsr) == -1) {
 perror(read);
 exit(1);
  }
  // posix_fadvise(fd, sizeof(buf)*lfsr, sizeof(buf), POSIX_FADV_DONTNEED);
  if (out) {
printf(read %lu in %lld microseconds\n, lfsr *sizeof(buf), microseconds()-start);
fflush(stdout);
  }
  usleep(500);
   }
   close(fd);
   exit(0);
}

//==
void child3(int end) {
   int fd, retval;
   long long start;
   off_t lfsr;
   int i;
   int j = 2;

   fd = open(/tmp/foo.txt,O_RDWR /*|O_CREAT | O_SYNC*/,0644);
   for (i = 0; i  131072/j; i++) {
  int u;
  lseek(fd, sizeof(buf)*(i*j), SEEK_SET);
  write(fd, buf , sizeof(buf));  
}
   close(fd);
   if (end)
 exit(0);
   sleep(60);
}

 
int main(void) {
   int fd0 = open(/tmp/foo.txt,O_RDWR |O_CREAT /*| O_SYNC*/,0644);
   int fd1 = open(/tmp/foo1.txt,O_RDWR |O_CREAT /*| O_SYNC*/,0644);
   
   int fd;
   long long start;
   long long end = 0;
   off_t lfsr = 0;
   memset(buf, 'a', sizeof(buf));
   ftruncate(fd0, f_size);
   ftruncate(fd1, f_size);
   printf (%d\n,RAND_MAX);
//   child3(0);
   
   if (!fork()) {
 child();
 exit(1);
   }
   
#if 0
   if (!fork()) {
 child2();
 exit(1);
   }
   if (!fork()) {
 child3(1);
 exit(1);
   }
#endif   
   srandom(1000+time(NULL));
   while(1) {
  fd = fd0;
  if (FLOCK(fd, LOCK_EX| LOCK_NB) == -1) {
 if (errno == EWOULDBLOCK)
fd =fd1;
  }
  lfsr = random()/sizeof(buf);
  if (out) {
 start = microseconds();
  }
//  if (pwrite(fd ,buf ,sizeof(buf), sizeof(buf)*lfsr) == -1) {
  lseek(fd, sizeof(buf)*lfsr, SEEK_SET);
  if (write(fd,buf,sizeof(buf)) == -1) {
 perror(write);
 exit(1);
  }
  if (out) {
printf(Write %lu in %lld microseconds\n, lfsr *sizeof(buf), microseconds()-start);
fflush(stdout);
  }
  if (fd == fd0) { 
 FLOCK(fd, LOCK_UN);
  }
  usleep(2);
   }
   close(fd);
   exit(0);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread didier
Hi,

On Sat, Mar 21, 2015 at 10:37 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:

   no logging: 18672 18792 18667 18518 18613 18547
 with logging: 18170 18093 18162 18273 18307 18234

 So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
 And with more expensive transactions (larger scale, writes, ...) the
 difference will be much smaller.


 Ok. Great!

 Let us take this as a worst-case figure and try some maths.

 If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
 collision probability would be about 1/40 and the delayed thread would be
 waiting for half this time on average, so the performance impact due to
 fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =
 1/3200 time added on the computed average, if I'm not mistaken).
If  threads run more or less the same code with the same timing after
a while they will lockstep  on synchronization primitives and your
collision probability will be very close to 1.

Moreover  they will write to the same cache lines for every fprintf
and this is very very bad even without atomic operations.

Regards
Didier


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-23 Thread didier
Hi,

On Sat, Mar 21, 2015 at 8:42 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:

 Hello Didier,

 If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
 collision probability would be about 1/40 and the delayed thread would be
 waiting for half this time on average, so the performance impact due to
 fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =
 1/3200 time added on the computed average, if I'm not mistaken).
Yes but for a third thread (each on a physical core) it will be 1/40 +
1/40 and so on up to roughly 40/40 for 40 cores.



 If  threads run more or less the same code with the same timing after
 a while they will lockstep  on synchronization primitives and your
 collision probability will be very close to 1.


 I'm not sure I understand. If transaction times were really constant, then
 after a while the mutexes would be synchronised so as to avoid contention,
 i.e. the collision probability would be 0?
But they aren't constant only close. It may or not show up in this
case but I've noticed that often the collision rate is a lot higher
than the probability would suggest, I'm not sure why,


 Moreover  they will write to the same cache lines for every fprintf
 and this is very very bad even without atomic operations.


 We're talking of transactions that involve network messages and possibly
 disk IOs on the server, so some cache issues issues within pgbench would not
 be a priori the main performance driver.
Sure but :
- good measurement is hard and by adding locking in fprintf it make
its timing more noisy.

- it's against 'good practices' for scalable code. Trivial code can
show that elapsed time for as low as  four cores writing to same cache
line in a loop, without locking or synchronization, is greater than
the elapsed time for running these four loops sequentially on one
core. If they write to different cache lines it scales linearly.

Regards
Didier


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign key wierdness

2003-01-23 Thread Didier Moens
Dear Tom, Dave,


Tom Lane wrote:


Ah-hah, and I'll bet that the column being linked to this one by the
foreign key constraint is still an integer?



It sure is ; being a PostgreSQL novice (BTW : many thanks to the whole 
of the PG development team for such an excellent product), I got on this 
track by means of 
http://archives.postgresql.org/pgsql-sql/2001-05/msg00395.php .


With two tables each containing some 20.000 entries, the fk creation 
time between both of them increases from ~ 1.8 secs to ~ 221 secs.
 


Seems odd that the cost would get *that* much worse.  Maybe we need to
look at whether the FK checking queries need to include explicit casts
...


Well, I reproduced the slowdown with some 20 to 30 different tables.
Anyway, glad I could be of some help, albeit only by testing some 
(probably quite meaningless) border cases ...  :)


Regards,
Didier

--

Didier Moens
-
RUG/VIB - Dept. Molecular Biomedical Research - Core IT
tel ++32(9)2645309 fax ++32(9)2645348
http://www.dmb.rug.ac.be



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Foreign key wierdness

2003-01-23 Thread Didier Moens
Hi all,

Dave Page wrote:


If you really think the schema qualification has something to 
do with it, try issuing the ADD FOREIGN KEY command manually 
in psql, with and without schema name.
   


Well to be honest I'm having a hard time believing it, but having looked
at this in some depth, it's the only thing that the 2 versions of
pgAdmin are doing differently. Even the PostgreSQL logs agree with that.
I'm relying on Didier for test results though as I don't have a test
system I can use for this at the moment.

But it gives us something to try - Didier can you create a new database
please, and load the data from 2 tables. VACUUM ANALYZE, then add the
foreign key in psql using the syntax 1.4.2 uses. Then drop the database,
and load exactly the same data in the same way, VACUUM ANALYZE again,
and create the fkey using the qualified tablename syntax.



I did some extensive testing using PostgreSQL 7.3.1 (logs and results 
available upon request), and the massive slowdown is NOT related to 
qualified tablename syntax or (lack of) VACUUM ANALYZE, but to the 
following change :

pgAdminII 1.4.2 :
---
CREATE TABLE articles (
   article_id integer DEFAULT 
nextval('articles_article_id_key'::text) NOT NULL,
...

test=# \d articles
   Table public.articles
Column  | Type  |  
Modifiers
-+---+-
article_id  | integer   | not null default 
nextval('articles_article_id_key'::text)
...

pgAdminII 1.4.12 :

CREATE TABLE articles (
   article_id bigint DEFAULT nextval('articles_article_id_key'::text) 
NOT NULL,
...

test=# \d articles
   Table public.articles
Column  | Type  |  
Modifiers
-+---+-
article_id  | bigint| not null default 
nextval('articles_article_id_key'::text)
...


With two tables each containing some 20.000 entries, the fk creation 
time between both of them increases from ~ 1.8 secs to ~ 221 secs.


Regards,
Didier

--

Didier Moens
-
RUG/VIB - Dept. Molecular Biomedical Research - Core IT
tel ++32(9)2645309 fax ++32(9)2645348
http://www.dmb.rug.ac.be



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Foreign key wierdness

2003-01-23 Thread Didier Moens
Dave Page wrote:


From what Tom has said in his reponse, I think the answer for you Didier

is to remap your integer columns to int8 instead of int4 and see what
happens. When I get a couple of minutes I will look at putting a Serials
as... Option in the type map.



Thanks Dave, for all of your invested time.

I think the value of tools such as pgAdmin, which provide an almost 
bumpless cross-platform migration path, cannot be underestimated.


Regards,
Didier

--

Didier Moens
-
RUG/VIB - Dept. Molecular Biomedical Research - Core IT
http://www.dmb.rug.ac.be



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]