Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-08-27 Thread Greg Smith
On 7/29/13 2:04 AM, KONDO Mitsumasa wrote: I think that it is almost same as small dirty_background_ratio or dirty_background_bytes. The main difference here is that all writes pushed out this way will be to a single 1GB relation chunk. The odds are better that multiple writes will combine,

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-08-22 Thread Jim Nasby
On 7/26/13 7:32 AM, Tom Lane wrote: Greg Smith g...@2ndquadrant.com writes: On 7/26/13 5:59 AM, Hannu Krosing wrote: Well, SSD disks do it in the way proposed by didier (AFAIK), by putting random fs pages on one large disk page and having an extra index layer for resolving random-to-sequential

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-29 Thread KONDO Mitsumasa
(2013/07/24 1:13), Greg Smith wrote: On 7/23/13 10:56 AM, Robert Haas wrote: On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote: We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith
On 7/25/13 6:02 PM, didier wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? If you do that, reads of the data will have to traverse that temporary file to assemble their data. You'll make every later reader pay the random

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Hannu Krosing
On 07/26/2013 11:42 AM, Greg Smith wrote: On 7/25/13 6:02 PM, didier wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? If you do that, reads of the data will have to traverse that temporary file to assemble their data. You'll

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Hannu Krosing
On 07/26/2013 11:42 AM, Greg Smith wrote: On 7/25/13 6:02 PM, didier wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? If you do that, reads of the data will have to traverse that temporary file to assemble their data. In

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith
On 7/26/13 5:59 AM, Hannu Krosing wrote: Well, SSD disks do it in the way proposed by didier (AFAIK), by putting random fs pages on one large disk page and having an extra index layer for resolving random-to-sequential ordering. If your solution to avoiding random writes now is to do

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes: On 7/26/13 5:59 AM, Hannu Krosing wrote: Well, SSD disks do it in the way proposed by didier (AFAIK), by putting random fs pages on one large disk page and having an extra index layer for resolving random-to-sequential ordering. If your solution to

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi, On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith g...@2ndquadrant.com wrote: On 7/25/13 6:02 PM, didier wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? If you do that, reads of the data will have to traverse that

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith
On 7/26/13 9:14 AM, didier wrote: During recovery you have to load the log in cache first before applying WAL. Checkpoints exist to bound recovery time after a crash. That is their only purpose. What you're suggesting moves a lot of work into the recovery path, which will slow down how

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith
On 7/26/13 8:32 AM, Tom Lane wrote: What I'd point out is that that is exactly what WAL does for us, ie convert a bunch of random writes into sequential writes. But sooner or later you have to put the data where it belongs. Hannu was observing that SSD often doesn't do that at all. They can

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi, On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith greg@2ndquadg...@2ndquadrant.com(needrant.comg...@2ndquadrant.com wrote: On 7/26/13 9:14 AM, didier wrote: During recovery you have to load the log in cache first before applying WAL. Checkpoints exist to bound recovery time after a crash.

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread didier
Hi On Tue, Jul 23, 2013 at 5:48 AM, Greg Smith g...@2ndquadrant.com wrote: Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative. I have a simple one in mind that captures the biggest problem I see: that the number of backend

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread Robert Haas
On Thu, Jul 25, 2013 at 6:02 PM, didier did...@gmail.com wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? With storage random speed at least five to ten time slower it could help a lot. Thanks Sure, that's what the WAL does.

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread didier
Hi, Sure, that's what the WAL does. But you still have to checkpoint eventually. Sure, when you run pg_ctl stop. Unlike the WAL it only needs two files, shared_buffers size. I did bogus tests by replacing mask |= BM_PERMANENT; with mask = -1 in BufferSync() and simulating checkpoint with

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-24 Thread Robert Haas
On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith g...@2ndquadrant.com wrote: On 7/23/13 10:56 AM, Robert Haas wrote: On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote: We know that a 1GB relation segment can take a really long time to write out. That could include up to 128

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Peter Geoghegan
On Mon, Jul 22, 2013 at 8:48 PM, Greg Smith g...@2ndquadrant.com wrote: And I can't get too excited about making this as my volunteer effort when I consider what the resulting credit will look like. Coding is by far the smallest part of work like this, first behind coming up with the design in

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Robert Haas
On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote: Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative. I have a simple one in mind that captures the biggest problem I see: that the number of backend and

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Greg Smith
On 7/23/13 10:56 AM, Robert Haas wrote: On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote: We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are