Re: [HACKERS] Move unused buffers to freelist
On Friday, May 24, 2013 8:22 PM Jim Nasby wrote: On 5/14/13 8:42 AM, Amit Kapila wrote: >> In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && >> refcount = 0) buffer’s to end of freelist. I have implemented a new API >> StrategyMoveBufferToFreeListEnd() to >> >> move buffer’s to end of freelist. >> > Instead of a separate function, would it be better to add an argument to > StrategyFreeBuffer? Yes, it could be done with a parameter which will decide whether to put buffer at head or tail in freelist. However currently the main focus is to check in which cases this optimization can give benefit. Robert had ran tests for quite a number of cases where it doesn't show any significant gain. I am also trying with various configurations to see if it gives any benefit. Robert has given some suggestions to change the way currently new function is getting called, I will try it and update the results of same. I am not very sure that default pgbench is a good test scenario to test this optimization. If you have any suggestions for tests where it can show benefit, that would be a great input. > ISTM this is similar to the other strategy stuff in the buffer manager, so > perhaps it should mirror that... With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On 05/24/2013 07:00 PM, Robert Haas wrote: > On Fri, May 24, 2013 at 11:29 AM, Robert Haas wrote: >> On Fri, May 24, 2013 at 10:53 AM, Andres Freund >> wrote: [all-visible cannot restore hint bits without FPI because of torn pages] >>> I haven't yet thought about this sufficiently yet. I think we might have >>> a chance of working around this, let me ponder a bit. >> Yeah. I too feel like there might be a solution. But I don't know >> have something specific in mind, yet anyway. > One thought I had is that it might be beneficial to freeze when a page > ceases to be all-visible, rather than when it becomes all-visible. That what I aimed to describe in my mail earlier, but your description is much clearer :) > Any operation that makes the page not-all-visible is going to emit an > FPI anyway, so we don't have to worry about torn pages in that case. > Under such a scheme, we'd have to enforce the rule that xmin and xmax > are ignored for any page that is all-visible; Agreed. We already relay on all-visible pages enough that we can trust it to be correct. Making that universal rule should not add any risks . The rule "page all-visible ==> assume all tuples frozen" would also enable VACUUM FREEZE to only work only on the non-all-visible pages . > and when a page ceases > to be all-visible, we have to go back and really freeze the > pre-existing tuples. We can do this unconditionally, or in milder case use vacuum_freeze_min_age if we want to retain xids for forensic purposes. > I think we might be able to use the existing > all_visible_cleared/new_all_visible_cleared flags to trigger this > behavior, without adding anything new to WAL at all. This seems to be easiest -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incomplete description of pg_start_backup?
On 24.05.2013 14:33, Dmitry Koterov wrote: I don't get still. Suppose we have a data file with blocks with important (non-empty) data: A B C D 1. I call pg_start_backup(). 2. Tar starts to copy A block to the destination archive... 3. During this copying, somebody removes data from a table which is situated in B block. So this data is a subject for vacuuming, and the block is marked as a free space. 4. Somebody writes data to a table, and this data is placed to a free space - to B block. This is also added to the WAL log (so the data is stored at 2 places: at B block and at WAL). 5. Tar (at last!) finishes copying of A block and begins to copy B block. 6. It finishes, then it copies C and D to the archive too. 7. Then we call pg_stop_backup() and also archive collected WAL (which contains the new data of B block as we saw above). The question is - *where is the OLD data of B block in this scheme?* Seems it is NOT in the backup! Correct. So it cannot be restored. Right, the old data on block B is gone. The backup is fine, the old data on block B is is not needed to recover the backup. (And, in case when we never overwrite blocks between pg_start_backup...pg_stop_backup, but always append the new data, it is not a problem.) Seems to me this is not documented at all! That is what my initial e-mail about. When you restore the backup, the database is restored to the state it was when pg_stop_backup() was called. What did you expect? (I have one hypothesis on that, but I am not sure. Here is it: does vacuum saves ALL deleted data of B block to WAL on step 3 prior deletion? If yes, it is, of course, a part of the backup. But it wastes space a lot...) Nope, it doesn't do that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibilitymap_set and checksums
On 24 May 2013 20:26, Andres Freund wrote: > On 2013-05-24 19:09:57 +0100, Simon Riggs wrote: >> On 24 May 2013 18:40, Andres Freund wrote: >> >> > That pattern looks dangerous. Setting the lsn of the heap page will >> > prevent the next action from doing a FPI even if it would be required. >> >> Can you be more specific about the danger you see? > > CHECKPOINT at lsn 0/10; > vacuum starts > vacuum finds page which is all visible > vacuum sets all_visible > PageSetAllVisible(page); > MarkBufferDirty(buf); > visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, > vmbuffer, visibility_cutoff_xid); > recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf, > cutoff_xid); > if (DataChecksumsEnabled()) > PageSetLSN(heapPage, recptr); > > So at this point the *heap* page will have the lsn of the > xl_heap_visible record. Which I thought to be rather dangerous because I > somewow missed the fact that log_heap_visible does: > if (DataChecksumsEnabled()) > { > rdata[1].next = &(rdata[2]); > > rdata[2].data = NULL; > rdata[2].len = 0; > rdata[2].buffer = heap_buffer; > rdata[2].buffer_std = true; > rdata[2].next = NULL; > } > > So. Forget what I said, I just was confused. I think its perfectly understandable. Robert, Jeff and I discussed that for a while before we passed it. I'm still not happy with it, and think its a pretty confusing section of code with multiple paths through it, but I just can't see a better way. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Sort
Let me introduce one thing we discussed in the developer meeting at Ottawa. We got a consensus that pluggable exec-node may be useful to replace a part of exec-node tree with an alternative one being implemented by extensions; which will allow to run something like "GpuSort" instead of existing Sort. http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting#Pluggable_plan.2Fexec_nodes 2013/5/24 james : >> Have you considered GPU-based sorting? I know there's been discussion in >> the past. > > If you use OpenCL, then you can use a CPU driver if there is no GPU, and > that can allow you to leverage all the CPU cores without having to do the > multi-thread stuff in the backend. > > While the compilation of a specific kernel can be quite expensive, it also > has the effect of a JIT compiler in terms of system independence. > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On 2013-05-24 15:49:31 -0400, Josh Berkus wrote: > If I understand your solution correctly, though, this doesn't really > help the pathological case for freezing, which is the time-oriented > append-only table. For data which isn't being used, allvisible won't be > set either because it won't have been read, no? Is it still cheaper to > set allvisible than vacuum freeze even in that case? all visible is only set in vacuum and it determines which parts of a table will be scanned in a non full table vacuum. So, since we won't regularly start vacuum in the insert only case there will still be a batch of work at once. But nearly all of that work is *already* performed. We would just what the details of that around for a bit. *But* since we now would only need to vacuum the non all-visible part that would get noticeably cheaper as well. I think for that case we should run vacuum more regularly for insert only tables since we currently don't do regularly enough which a) increases the amount of work needed at once and b) prevents index only scans from working there. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] adding import in pl/python function
On 24 May 2013 21:46, Claudio Freire wrote: > On Fri, May 24, 2013 at 4:22 PM, Szymon Guz wrote: > > Hm... maybe you're right. I think I don't understand fully how the > > procedures are executed, and I need to read more to get it. > > > Well, it's easy. > > Instead of PLyFloat_FromNumeric[0], you can make a > PLyDecimal_FromNumeric. There, you'd do with the Python/C[1]: > > PyObject *decimal = PyImport_Import("decimal"); > PyObject *decimal_dict = PyModule_GetDict(decimal); > PyObject *decimal_ctor = PyDict_GetItemString(decimal_dict, "Decimal"); > > And invoke it with a string rep of your Numeric: > > PyObject *value = PyObject_CallFunction(decimal_ctor, "S", string_value); > > Add of course all kinds of error checking and reference count boiler > plate, and you'd have a very dumb version of it. > > To make it more "pro", you'd want to do all that stuff to get > decimal_ctor only at initialization time. Especially since you don't > want to fumble with the import lock right there in _FromNumeric. > > And to make it totally "pro", you can even freeze Decimal (using > pyfreeze) if you'd like. I would only do this in contexts where you > don't have a stdlib of course. Not sure whether windows falls into > that category. Linux doesn't. > > > [0] > http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpython/plpy_typeio.c#l518 > [1] http://docs.python.org/2/c-api/import.html > Thanks, I will take a look at this, looks pretty easy. However testing on Windows will be pretty funny :) thanks, Szymon
Re: [HACKERS] getting rid of freezing
Andres, If I understand your solution correctly, though, this doesn't really help the pathological case for freezing, which is the time-oriented append-only table. For data which isn't being used, allvisible won't be set either because it won't have been read, no? Is it still cheaper to set allvisible than vacuum freeze even in that case? Don't get me wrong, I'm in favor of this if it fixes the other (more common) cases. I just want to be clear on the limitations. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] adding import in pl/python function
On Fri, May 24, 2013 at 4:22 PM, Szymon Guz wrote: > Hm... maybe you're right. I think I don't understand fully how the > procedures are executed, and I need to read more to get it. Well, it's easy. Instead of PLyFloat_FromNumeric[0], you can make a PLyDecimal_FromNumeric. There, you'd do with the Python/C[1]: PyObject *decimal = PyImport_Import("decimal"); PyObject *decimal_dict = PyModule_GetDict(decimal); PyObject *decimal_ctor = PyDict_GetItemString(decimal_dict, "Decimal"); And invoke it with a string rep of your Numeric: PyObject *value = PyObject_CallFunction(decimal_ctor, "S", string_value); Add of course all kinds of error checking and reference count boiler plate, and you'd have a very dumb version of it. To make it more "pro", you'd want to do all that stuff to get decimal_ctor only at initialization time. Especially since you don't want to fumble with the import lock right there in _FromNumeric. And to make it totally "pro", you can even freeze Decimal (using pyfreeze) if you'd like. I would only do this in contexts where you don't have a stdlib of course. Not sure whether windows falls into that category. Linux doesn't. [0] http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpython/plpy_typeio.c#l518 [1] http://docs.python.org/2/c-api/import.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibilitymap_set and checksums
On 2013-05-24 19:09:57 +0100, Simon Riggs wrote: > On 24 May 2013 18:40, Andres Freund wrote: > > > That pattern looks dangerous. Setting the lsn of the heap page will > > prevent the next action from doing a FPI even if it would be required. > > Can you be more specific about the danger you see? CHECKPOINT at lsn 0/10; vacuum starts vacuum finds page which is all visible vacuum sets all_visible PageSetAllVisible(page); MarkBufferDirty(buf); visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer, visibility_cutoff_xid); recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf, cutoff_xid); if (DataChecksumsEnabled()) PageSetLSN(heapPage, recptr); So at this point the *heap* page will have the lsn of the xl_heap_visible record. Which I thought to be rather dangerous because I somewow missed the fact that log_heap_visible does: if (DataChecksumsEnabled()) { rdata[1].next = &(rdata[2]); rdata[2].data = NULL; rdata[2].len = 0; rdata[2].buffer = heap_buffer; rdata[2].buffer_std = true; rdata[2].next = NULL; } So. Forget what I said, I just was confused. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] adding import in pl/python function
On 24 May 2013 21:14, Claudio Freire wrote: > On Fri, May 24, 2013 at 4:10 PM, Szymon Guz wrote: > > > > I'm thinking about something else. We could convert it into Decimal > > (http://docs.python.org/2/library/decimal.html) class in Python. > > Unfortunately this class requires import like `from decimal import > Decimal` > > from a standard Python library. > > > > I'm wondering if it would be a good idea to do it like this. It shouldn't > > fail even with the trusted version of pl/python, as I'd rather see the > > trusted version to allow importing packages from standard library. > > > Why would passing a decimal require an import? > > The extension itself needs a reference to Decimal, to build them, but > the procedure's context doesn't need to have it. > Hm... maybe you're right. I think I don't understand fully how the procedures are executed, and I need to read more to get it. thanks, Szymon
[HACKERS] shmem startup and shutdown hooks
Hi, While going through Andres' BDR code I noticed that it has a shmem startup hook, which installs a on_shmem_exit() callback to write stuff at shutdown time. This works fine but seems somewhat of a hazard: it is having postmaster do the actual write, which has to access shared memory while doing it. Moreover, since the callback is installed to be called at shmem reset, it means that a process crash causes this code to be called, which seems similarly broken. It seems that the right place to do this is checkpointer shutdown, i.e. when checkpointer is told to close shop it should also invoke various modules' shutdown callbacks. There's no hook point there though, so we'd need to introduce something new for this specific purpose. We already have pg_stat_statements doing this, viz. pgss_shmem_startup(): /* * If we're in the postmaster (or a standalone backend...), set up a shmem * exit hook to dump the statistics to disk. */ if (!IsUnderPostmaster) on_shmem_exit(pgss_shmem_shutdown, (Datum) 0); Also, it'd be good to discuss what should happen when another process crashes. Should we attempt to write the data anyway, or should we reset it just like we do for pgstats? Thoughts? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] adding import in pl/python function
On Fri, May 24, 2013 at 4:10 PM, Szymon Guz wrote: > > I'm thinking about something else. We could convert it into Decimal > (http://docs.python.org/2/library/decimal.html) class in Python. > Unfortunately this class requires import like `from decimal import Decimal` > from a standard Python library. > > I'm wondering if it would be a good idea to do it like this. It shouldn't > fail even with the trusted version of pl/python, as I'd rather see the > trusted version to allow importing packages from standard library. Why would passing a decimal require an import? The extension itself needs a reference to Decimal, to build them, but the procedure's context doesn't need to have it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On Fri, May 24, 2013 at 10:36 AM, Jim Nasby wrote: > Doesn't that hit the old issue of not knowing if a read came from FS cache > or disk? I realize that the current cost_delay mechanism suffers from that > too, but since the API is lower level that restriction is much more > apparent. Sure, but I think it's still useful despite that limitation. > Instead of KB/s, could we look at how much time one process is spending > waiting on IO vs the rest of the cluster? Is it reasonable for us to measure > IO wait time for every request, at least on the most popular OSes? I doubt that's going to be very meaningful. The backend that dirties the buffer is fairly likely to be different from the backend that writes it out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] adding import in pl/python function
Hi, I'm wondering if it would be OK to change the procedure code before execution. I'm thinking about adding magically an import at the beginning of a function. Currently numeric arguments passed to the procedure are converted into floats. This is not good, as it causes loss of information. The proposed solution in code comment is "maybe use a string?". I'm thinking about something else. We could convert it into Decimal ( http://docs.python.org/2/library/decimal.html) class in Python. Unfortunately this class requires import like `from decimal import Decimal` from a standard Python library. I'm wondering if it would be a good idea to do it like this. It shouldn't fail even with the trusted version of pl/python, as I'd rather see the trusted version to allow importing packages from standard library. regards, Szymon
Re: [HACKERS] Parallel Sort
> Have you considered GPU-based sorting? I know there's been discussion in the past. If you use OpenCL, then you can use a CPU driver if there is no GPU, and that can allow you to leverage all the CPU cores without having to do the multi-thread stuff in the backend. While the compilation of a specific kernel can be quite expensive, it also has the effect of a JIT compiler in terms of system independence. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 5/24/13 9:21 AM, Robert Haas wrote: But I wonder if we wouldn't be better off coming up with a little more user-friendly API. Instead of exposing a cost delay, a cost limit, and various charges, perhaps we should just provide limits measured in KB/s, like dirty_rate_limit = and read_rate_limit = . I already made and lost the argument for doing vacuum in KB/s units, so I wasn't planning on putting that in the way of this one. I still think it's possible to switch to real world units and simplify all of those parameters. Maybe I'll get the energy to fight this battle again for 9.4. I do have a lot more tuning data from production deployments to use as evidence now. I don't think the UI end changes the bulk of the implementation work though. The time consuming part of this development is inserting all of the cost delay hooks and validating they work. Exactly what parameters and logic fires when they are called can easily be refactored later. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL segments (names) not in a sequence
On Fri, May 24, 2013 at 3:24 PM, Amit Langote wrote: > On Sat, May 25, 2013 at 3:08 AM, German Becker > wrote: > > Thanks Amit, I understand now. Is there a way to know/predict how many > > prealocated segments will there be in a certain moment? What does it > deppend > > on? > > Upthread, Fujii Masao-san suggested what might have happened that > caused these pre-allocated segments to be created. To quote him: > > "WAL recycling is performed by checkpoint. Checkpoint always checks > whether there are WAL files no longer required for crash recovery, > IOW, WAL files which were generated before the prior checkpoint > happened, and then if they are found, checkpoint tries to recycle > them." > > Reading here would also help: > http://www.postgresql.org/docs/9.1/static/wal-configuration.html > > If you are still using the same values as during this observation, > could you provide values for these postgresql.conf parameters: > checkpoint_segments, checkpoint_timeout, wal_keep_segments? > > > -- > Amit Langote > Amit, Frist, thanks for your help and your interest. I (think) I understand how checkpoint /wal segment work. What I didn't understand from the documentation is the possibility of segments being prealocated. I thought that the WAL segment with the higher sequence number is the one being written at present time, as opposed to a segment allocated to be written in the future. If you could clarify this point to me, that would be great. Here are the parameters related to checkpoint For "restore" config checkpoint_segments = 256 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s #wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables For "production" config: checkpoint_segments = 256 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 - 1.0 wal_keep_segments = 256 # in logfile segments, 16MB each; 0 disables Thanks again, Germán
Re: [HACKERS] Cost limited statements RFC
On 5/24/13 10:36 AM, Jim Nasby wrote: Instead of KB/s, could we look at how much time one process is spending waiting on IO vs the rest of the cluster? Is it reasonable for us to measure IO wait time for every request, at least on the most popular OSes? It's not just an OS specific issue. The overhead of collecting timing data varies massively based on your hardware, which is why there's the pg_test_timing tool now to help quantify that. I have a design I'm working on that exposes the system load to the database usefully. That's what I think people really want if the goal is to be adaptive based on what else is going on. My idea is to use what "uptime" collects as a starting useful set of numbers to quantify what's going on. If you have both a short term load measurement and a longer term one like uptime provides, you can quantify both the overall load and whether it's rising or falling. I want to swipe some ideas on how moving averages are used to determine trend in stock trading systems: http://www.onlinetradingconcepts.com/TechnicalAnalysis/MASimple2.html Dynamic load-sensitive statement limits and autovacuum are completely feasible on UNIX-like systems. The work to insert a cost delay point needs to get done before building more complicated logic on top of it though, so I'm starting with this part. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incomplete description of pg_start_backup?
I don't get still. Suppose we have a data file with blocks with important (non-empty) data: A B C D 1. I call pg_start_backup(). 2. Tar starts to copy A block to the destination archive... 3. During this copying, somebody removes data from a table which is situated in B block. So this data is a subject for vacuuming, and the block is marked as a free space. 4. Somebody writes data to a table, and this data is placed to a free space - to B block. This is also added to the WAL log (so the data is stored at 2 places: at B block and at WAL). 5. Tar (at last!) finishes copying of A block and begins to copy B block. 6. It finishes, then it copies C and D to the archive too. 7. Then we call pg_stop_backup() and also archive collected WAL (which contains the new data of B block as we saw above). The question is - *where is the OLD data of B block in this scheme?* Seems it is NOT in the backup! So it cannot be restored. (And, in case when we never overwrite blocks between pg_start_backup...pg_stop_backup, but always append the new data, it is not a problem.) Seems to me this is not documented at all! That is what my initial e-mail about. (I have one hypothesis on that, but I am not sure. Here is it: does vacuum saves ALL deleted data of B block to WAL on step 3 prior deletion? If yes, it is, of course, a part of the backup. But it wastes space a lot...) On Tue, May 14, 2013 at 6:05 PM, Jeff Janes wrote: > On Mon, May 13, 2013 at 4:31 PM, Dmitry Koterov wrote: > >> Could you please provide a bit more detailed explanation on how it works? >> >> And how could postgres write at the middle of archiving files during an >> active pg_start_backup? if it could, here might be a case when a part of >> archived data file contains an overridden information "from the future", >> > > The data files cannot contain information from the future. If the backup > is restored, it must be restored to the time of pg_stop_backup (at least), > which means the data would at that point be from the past/present, not the > future. > > Cheers, > > Jeff >
Re: [HACKERS] WAL segments (names) not in a sequence
On Sat, May 25, 2013 at 3:08 AM, German Becker wrote: > Thanks Amit, I understand now. Is there a way to know/predict how many > prealocated segments will there be in a certain moment? What does it deppend > on? Upthread, Fujii Masao-san suggested what might have happened that caused these pre-allocated segments to be created. To quote him: "WAL recycling is performed by checkpoint. Checkpoint always checks whether there are WAL files no longer required for crash recovery, IOW, WAL files which were generated before the prior checkpoint happened, and then if they are found, checkpoint tries to recycle them." Reading here would also help: http://www.postgresql.org/docs/9.1/static/wal-configuration.html If you are still using the same values as during this observation, could you provide values for these postgresql.conf parameters: checkpoint_segments, checkpoint_timeout, wal_keep_segments? -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Sort
On 5/13/13 9:28 AM, Noah Misch wrote: It would be great if one client session could take advantage of multiple CPU cores. EnterpriseDB wishes to start the trek into this problem space for 9.4 by implementing parallel internal (i.e. not spilling to disk) sort. This touches on a notable subset of the infrastructure components we'll need for parallel general query. My intent is to map out the key design topics, hear about critical topics I hadn't considered, and solicit feedback on the quality of the high-level plan. Full designs for key pieces will come later. Have you considered GPU-based sorting? I know there's been discussion in the past. To me, the biggest advantage of GPU sorting is that most of the concerns you've laid out go away; a backend that needs to sort just throws data at the GPU to do the actual sorting; all the MVCC issues and what not remain within the scope of a single backend. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibilitymap_set and checksums
On 24 May 2013 18:40, Andres Freund wrote: > That pattern looks dangerous. Setting the lsn of the heap page will > prevent the next action from doing a FPI even if it would be required. Can you be more specific about the danger you see? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL segments (names) not in a sequence
Thanks Amit, I understand now. Is there a way to know/predict how many prealocated segments will there be in a certain moment? What does it deppend on? On Fri, May 24, 2013 at 12:46 PM, Amit Langote wrote: > > I didn't quite understand what you mean by that... But anyways so do you > > people think this sequence number overlap is "normal" ? > > There is "no overlap" at all. The newer segments that you see are > "pre-allocated" ones. They have not been written to yet. > > From the "ls -l pg_xlog" output that you sent, it can be seen that > segments starting from 0001000E00A8 through > 00010010007E have been pre-allocated (at that point of > time) and 0001000E00A7 is currently being written to. Just > look at the modified times in your "ls -l" listing. > 0001000E00A7 has May 22 15:32 (the latest writes seem to > have happened to this segment) whereas pre-allocated ones seem to have > around May 22 12:05 to 12:15 (which are yet to be written to). > > Does that help? > > -- > Amit Langote >
[HACKERS] visibilitymap_set and checksums
Hi, while thinking about vacuum freeze I noticed that since the checksums patch visibilitymap_set() does: /* * If data checksums are enabled, we need to protect the heap * page from being torn. */ if (DataChecksumsEnabled()) { Page heapPage = BufferGetPage(heapBuf); /* caller is expected to set PD_ALL_VISIBLE first */ Assert(PageIsAllVisible(heapPage)); PageSetLSN(heapPage, recptr); } That pattern looks dangerous. Setting the lsn of the heap page will prevent the next action from doing a FPI even if it would be required. Its e.g. called like this from lazy_scan_heap: if (all_visible && !all_visible_according_to_vm) { /* * It should never be the case that the visibility map page is set * while the page-level bit is clear, but the reverse is allowed * (if checksums are not enabled). Regardless, set the both bits * so that we get back in sync. * * NB: If the heap page is all-visible but the VM bit is not set, * we don't need to dirty the heap page. However, if checksums are * enabled, we do need to make sure that the heap page is dirtied * before passing it to visibilitymap_set(), because it may be * logged. Given that this situation should only happen in rare * cases after a crash, it is not worth optimizing. */ PageSetAllVisible(page); MarkBufferDirty(buf); visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer, visibility_cutoff_xid); } other callers look similarly dangerous. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGCON meetup FreeNAS/FreeBSD: In Ottawa Tue & Wed.
I'm moving this to -advocacy, as it seems more appropriate there... On 5/20/13 10:31 AM, Alfred Perlstein wrote: Hello PostgreSQL Hackers, I am now in Ottawa, last week we wrapped up the BSDCon and I was hoping to chat with a few Postgresql developers in person about using Postgresql in FreeNAS and offering it as an extension to the platform as a plug-in technology. Unfortunately due to time constraints I can not attend the entire conference and I am only in town until Wednesday at noon. I'm hoping there's a good time to talk to a few developers about Postgresql + FreeNAS before I have to depart back to the bay area. Were you able to meetup with anyone? If not you should attend the SF Postgres Users Group meeting. Some info on me: My name is Alfred Perlstein, I am a FreeBSD developer and FreeNAS project lead. I am the VP of Software Engineering at iXsystems. I have been a fan of Postgresql for many years. In the early 2000s we build a high speed web tracking application on top of Postgresql and worked closely with the community to shake out performance and bug, so closely that Tom Lane and Vadim Mikheevhad logins on our box. Since that time I have tried to get Postgresql into as many places as possible. Some info on the topics I wanted to briefly discuss: 1) Using Postgresql as the config store for FreeNAS. We currently use SQLITE, SQLITE fits our needs until we get to the point of replication between HA (high availability) units. Then we are forced to manually sync data between configurations. A discussion on how we might do this better using Postgresql, while still maintaining our ease of config export (single file) and small footprint would be interesting. Probably best to discuss in person at SFPUG... :) 2) Postgresql plugin for FreeNAS. Flip a switch and suddenly your file server is also serving enterprise data. We currently have a plug-in architecture, but would like to discuss the possibility of a tighter integration so that Postgresql looks like a more cohesive addition to FreeNAS. Doesn't sound like that should be that terribly hard; you'd probably want to tweak the stock postgresql.conf for improved performance. 3) Statistic monitoring / EagleEye In FreeBSD/FreeNAS I have developed a system called EagleEye. EagleEye is a system where all mibs are easily exportable with timestamps in a common format (for now YAML & modified CSV) which is then consumed by a utility which can then provide graphs. The entire point of EagleEye is to eventually upstream the modifications to future proof statistics tracking into the FreeBSD and FreeNAS systems. I have spoken with some Illuminos/ZFS developers and they are interested as well. Are you thinking of storing that data in Postgres? You might find http://labs.omniti.com/labs/reconnoiter interesting. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On 5/24/13 9:53 AM, Andres Freund wrote: We don't even necessarily need to log the hint bits for all items since > >the redo for all_visible could make sure all items are hinted. The only > >problem is knowing up to where we can truncate pg_clog... >[all-visible cannot restore hint bits without FPI because of torn pages] I haven't yet thought about this sufficiently yet. I think we might have a chance of working around this, let me ponder a bit. But even if that means needing a full page write via the usual mechanism for all visible if any hint bits needed to be set we are still out far ahead of the current state imo. * cleanup would quite possibly do an FPI shortly after in vacuum anyway. If we do it for all visible, it possibly does not need to be done for it. * freezing would FPI almost guaranteedly since we do it so much later. * Not having to rescan the whole heap will be a bigger cost saving... Would we only set all the hint bits within vacuum? If so I don't think the WAL hit matters at all, because vacuum is almost always a background, throttled process. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] background processes vs. hot standby
On Fri, May 24, 2013 at 11:25 AM, Alvaro Herrera wrote: > Andres Freund escribió: >> I wonder if we shouldn't make background workers use connections slots >> from max_connections similar to how superuser_reserved_connections >> work. That would mean we don't need to care about it for HS. > > I remember considering this and concluding that it's messy. Suppose we > decide that the registered bgworker number would be subtracted from > max_connections: if the configuration registers as many bgworkers as > max_connections, then no client connections can take place; if there are > more bgworkers than max_connections, there's going to be errors at > startup because the last few bgworkers cannot start at all (and no > client connections will be allowed). So users would be forced to change > max_connections according to bgworkers configuration. That doesn't > sound friendly. I agree. To put that more succinctly, if we take that approach, then max_connections is no longer the maximum number of connections, which is a POLA violation. > (I currently have no proposal on how to go about solving this problem.) If the problem were only with the size of the lock table, I'd be somewhat inclined to propose ripping out max_locks_per_transaction and putting in a GUC called max_locks instead. The current system seems more confusing than helpful; when the default proves insufficient, the recourse is usually to figure out how many objects we need to lock, and then divide by max_connections to figure out how to set max_locks_per_transaction, which is really backing into the problem from the wrong end. But I don't know what to do about the problem of needing to know how many backends there are. I agree with Andres that it's not very friendly to enforce a restriction that all the same worker processes must be present on the standby. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On Fri, May 24, 2013 at 11:52 AM, Andres Freund wrote: >> The basic problem is that if the data is going to be removed before it >> would have gotten frozen, then the extra FPIs are just overhead. In >> effect, we're just deciding to freeze a lot sooner. > > Well, freezing without removing information for debugging. Sure, but what I'm trying to avoid is incurring the WAL cost of freezing. If we didn't mind paying that sooner, we could just drop vacuum_freeze_min/table_age. But we do mind that. >> And while that >> might well be beneficial in some use cases (e.g. the data's already in >> cache) it might also not be so beneficial (the table is larger than >> cache and would have been dropped before freezing kicked in). > > Not sure how caching comes into play here? At this point we know the > page to be in cache already since vacuum is looking at it anyway? OK, true. > I think it's not really comparable since in those situations we a) > already do an XLogInsert(). b) already dirty the page. so the only > change is that we possibly write an additionall full page image. If > there is actually near future DML write activity that would make the > all-visible superflous that would have to FPI likely anyway. Well, if there's near-future write activity, then freezing is pretty worthless anyway. What I'm trying to avoid is adding WAL overhead in the case where there *isnt* any near-future write activity, like inserting 100MB of data into an existing table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On Fri, May 24, 2013 at 11:29 AM, Robert Haas wrote: > On Fri, May 24, 2013 at 10:53 AM, Andres Freund > wrote: >>> [all-visible cannot restore hint bits without FPI because of torn pages] >> >> I haven't yet thought about this sufficiently yet. I think we might have >> a chance of working around this, let me ponder a bit. > > Yeah. I too feel like there might be a solution. But I don't know > have something specific in mind, yet anyway. One thought I had is that it might be beneficial to freeze when a page ceases to be all-visible, rather than when it becomes all-visible. Any operation that makes the page not-all-visible is going to emit an FPI anyway, so we don't have to worry about torn pages in that case. Under such a scheme, we'd have to enforce the rule that xmin and xmax are ignored for any page that is all-visible; and when a page ceases to be all-visible, we have to go back and really freeze the pre-existing tuples. I think we might be able to use the existing all_visible_cleared/new_all_visible_cleared flags to trigger this behavior, without adding anything new to WAL at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On 2013-05-24 11:29:10 -0400, Robert Haas wrote: > > But even if that means needing a full page write via the usual mechanism > > for all visible if any hint bits needed to be set we are still out far > > ahead of the current state imo. > > * cleanup would quite possibly do an FPI shortly after in vacuum > > anyway. If we do it for all visible, it possibly does not need to be > > done for it. > > * freezing would FPI almost guaranteedly since we do it so much > > later. > > * Not having to rescan the whole heap will be a bigger cost saving... > > The basic problem is that if the data is going to be removed before it > would have gotten frozen, then the extra FPIs are just overhead. In > effect, we're just deciding to freeze a lot sooner. Well, freezing without removing information for debugging. > And while that > might well be beneficial in some use cases (e.g. the data's already in > cache) it might also not be so beneficial (the table is larger than > cache and would have been dropped before freezing kicked in). Not sure how caching comes into play here? At this point we know the page to be in cache already since vacuum is looking at it anyway? I think it's not really comparable since in those situations we a) already do an XLogInsert(). b) already dirty the page. so the only change is that we possibly write an additionall full page image. If there is actually near future DML write activity that would make the all-visible superflous that would have to FPI likely anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL segments (names) not in a sequence
> I didn't quite understand what you mean by that... But anyways so do you > people think this sequence number overlap is "normal" ? There is "no overlap" at all. The newer segments that you see are "pre-allocated" ones. They have not been written to yet. >From the "ls -l pg_xlog" output that you sent, it can be seen that segments starting from 0001000E00A8 through 00010010007E have been pre-allocated (at that point of time) and 0001000E00A7 is currently being written to. Just look at the modified times in your "ls -l" listing. 0001000E00A7 has May 22 15:32 (the latest writes seem to have happened to this segment) whereas pre-allocated ones seem to have around May 22 12:05 to 12:15 (which are yet to be written to). Does that help? -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On Fri, May 24, 2013 at 10:53 AM, Andres Freund wrote: >> [all-visible cannot restore hint bits without FPI because of torn pages] > > I haven't yet thought about this sufficiently yet. I think we might have > a chance of working around this, let me ponder a bit. Yeah. I too feel like there might be a solution. But I don't know have something specific in mind, yet anyway. > But even if that means needing a full page write via the usual mechanism > for all visible if any hint bits needed to be set we are still out far > ahead of the current state imo. > * cleanup would quite possibly do an FPI shortly after in vacuum > anyway. If we do it for all visible, it possibly does not need to be > done for it. > * freezing would FPI almost guaranteedly since we do it so much > later. > * Not having to rescan the whole heap will be a bigger cost saving... The basic problem is that if the data is going to be removed before it would have gotten frozen, then the extra FPIs are just overhead. In effect, we're just deciding to freeze a lot sooner. And while that might well be beneficial in some use cases (e.g. the data's already in cache) it might also not be so beneficial (the table is larger than cache and would have been dropped before freezing kicked in). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL segments (names) not in a sequence
On Fri, May 24, 2013 at 10:01 AM, Amit Langote wrote: > > Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for > > migration (and it is working fine). The streaming replication is for > > hot-standby replication *once migrated*. Thing is I disbable archving and > > set wal_level to minimal, when migrating the large portion of data, to > make > > it faster. Then I switch to wal_level=hot_standby, i.e the "production" > > configuration, and the WAL segment seuqence seems to overlap with the > > segments generated with the other setting. > > > > Though, now you understand it's not what it looks like, right? :-) > > > -- > Amit Langote > I didn't quite understand what you mean by that... But anyways so do you people think this sequence number overlap is "normal" ?
Re: [HACKERS] background processes vs. hot standby
Andres Freund escribió: > I wonder if we shouldn't make background workers use connections slots > from max_connections similar to how superuser_reserved_connections > work. That would mean we don't need to care about it for HS. I remember considering this and concluding that it's messy. Suppose we decide that the registered bgworker number would be subtracted from max_connections: if the configuration registers as many bgworkers as max_connections, then no client connections can take place; if there are more bgworkers than max_connections, there's going to be errors at startup because the last few bgworkers cannot start at all (and no client connections will be allowed). So users would be forced to change max_connections according to bgworkers configuration. That doesn't sound friendly. If, instead of subtracting bgworkers from max_connections, we were to add the number of bgworkers to max_connections, then we're no better than currently, because the number of bgworkers from the standby would be different from those in the master, and we'd be back into the problem of how to ensure that the allowed number of locks meets the restriction. If you wanted to have more bgworkers in the master than the standby, you'd have to advise users to increase max_connections in the standby to fulfill the restriction. (I currently have no proposal on how to go about solving this problem.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover
On 5/23/13 12:51 PM, Pavan Deolasee wrote: On Thu, May 23, 2013 at 11:10 PM, Heikki Linnakangas mailto:hlinnakan...@vmware.com>> wrote: On 23.05.2013 07:55, Robert Haas wrote: On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas mailto:hlinnakan...@vmware.com>> wrote: 1. Scan the WAL log of the old cluster, starting from the point where the new cluster's timeline history forked off from the old cluster. For each WAL record, make a note of the data blocks that are touched. This yields a list of all the data blocks that were changed in the old cluster, after the new cluster forked off. Suppose that a transaction is open and has written tuples at the point where WAL forks. After WAL forks, the transaction commits. Then, it hints some of the tuples that it wrote. There is no record in WAL that those blocks are changed, but failing to revert them leads to data corruption. Bummer, you're right. Hmm, if you have checksums enabled, however, we'll WAL log a full-page every time a page is dirtied for setting a hint bit, which fixes the problem. So, there's a caveat with pg_rewind; you must have checksums enabled. I was quite impressed with the idea, but hint bits indeed are problem. I realised the same issue also applies to the other idea that Fujii-san and others have suggested about waiting for dirty buffers to be written until the WAL is received at the standby. But since that idea would anyways need to be implemented in the core, we could teach SetHintBits() to return false unless the corresponding commit WAL records are written to the standby first. Would it be useful to turn this problem around? Heikki's proposal is based on being able to track (without fail) all blocks that have been modified; could we instead track blocks that we know for certain have NOT been modified? The difference there is that we can be more conservative in stating "we know this block is the same"; worst case we just do some extra copying. One possibility would be to use file timestamps. For files that are past a certain age on both master and slave, if we force the timestamp on the slave to match the timestamp from the master, rsync will be able to safely ignore that file. I realize that's not as good as block-level detection, but it's probably a tremendous improvement over what we have today. The critical thing in this case would be to *guarantee* that the timestamps did not match on modified files. Of course, screwing around with FS timestamps in this manner is pretty grotty, at least on a live system. Perhaps there's some way to track that info separately and then use it to change file timestamps before running rsync. Or if we are able to define a list of files that we think may have changed, we just feed that list to rsync and let it do the heavy lifting. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch to add support of "IF NOT EXISTS" to others "CREATE" statements
Hi all, I working in a patch to include support of "IF NOT EXISTS" into "CREATE" statements that not have it yet. I started with "DefineStmt" section from "src/backend/parser/gram.y": - CREATE AGGREGATE [ IF NOT EXISTS ] ... - CREATE OPERATOR [ IF NOT EXISTS ] ... - CREATE TYPE [ IF NOT EXISTS ] ... [AS [{ENUM | RANGE}] (...)] - CREATE TEXT SEARCH {PARSER | DITIONARY | TEMPLATE | CONFIGURATION} [ IF NOT EXISTS ] ... - CREATE COLLATION [ IF NOT EXISTS ] ... My intention is cover anothers CREATE statements too, not just the above. If has no objection about this implementation I'll finish him and soon I sent the patch. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Blog sobre TI: http://fabriziomello.blogspot.com >> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello
Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover
On 2013-05-24 10:57:22 -0400, Thom Brown wrote: > By the way, without any data inserted I get: > > thom@swift /tmp $ pg_rewind --target-pgdata=/tmp/primary > --source-server='host=localhost port=5531 dbname=postgres' -v > connected to remote server > fetched file "global/pg_control", length 8192 > fetched file "pg_xlog/0002.history", length 41 > Last common WAL position: 0/300 on timeline 1 > pg_rewind: xlogreader.c:214: XLogReadRecord: Assertion `((RecPtr) % > 8192 >= (((intptr_t) ((sizeof(XLogPageHeaderData))) + ((8) - 1)) & > ~((intptr_t) ((8) - 1' failed. > Aborted (core dumped) > > And this magically goes away with -k on initdb. That looks like this is a bug independent of -k. pg_rewind tries to read the beginning of a page but xlogreader doesn't allow that atm. The likely reason this doesn't happen with -k is autovacuum probably hints some pages which are logged when checksums are enabled. Should we make that assert Assert((RecPtr % XLOG_BLCKSZ == 0) ||XRecOffIsValid(RecPtr)); instead? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover
On 23 May 2013 10:03, Thom Brown wrote: > On 23 May 2013 07:10, Heikki Linnakangas wrote: >> Hi, >> >> I've been hacking on a tool to allow resynchronizing an old master server >> after failover. The need to do a full backup/restore has been a common >> complaint ever since we've had streaming replication. I saw on the wiki that >> this was discussed in the dev meeting; too bad I couldn't make it. >> >> In a nutshell, the idea is to do copy everything that has changed between >> the cluster, like rsync does, but instead of reading through all files, use >> the WAL to determine what has changed. Here's a somewhat more detailed >> explanation, from the README: >> >> Theory of operation >> --- >> >> The basic idea is to copy everything from the new cluster to old, except for >> the blocks that we know to be the same. >> >> 1. Scan the WAL log of the old cluster, starting from the point where >> the new cluster's timeline history forked off from the old cluster. For each >> WAL record, make a note of the data blocks that are touched. This yields a >> list of all the data blocks that were changed in the old cluster, after the >> new cluster forked off. >> >> 2. Copy all those changed blocks from the new master to the old master. >> >> 3. Copy all other files like clog, conf files etc. from the new cluster >> to old. Everything except the relation files. >> >> 4. Apply the WAL from the new master, starting from the checkpoint >> created at failover. (pg_rewind doesn't actually apply the WAL, it just >> creates a backup label file indicating that when PostgreSQL is started, it >> will start replay from that checkpoint and apply all the required WAL) >> >> >> Please take a look: https://github.com/vmware/pg_rewind > > 6 instances set up: > > [Primary (5530)] > | > ---[Standby 1 (5531)] > | > ---[Standby 2 (5532)] > | > ---[Standby 3 (5533)] > | > ---[Standby 4 (5534)] > | > ---[Standby 5 (5535)] > > 1) Created a table on the primary with some data. > 2) Promoted Standby 1 > 3) Cleanly shut down Primary > 4) pg_rewind --target-pgdata=/tmp/primary > --source-server='host=localhost port=5531 dbname=postgres' > > Last common WAL position: 0/30227F8 on timeline 1 > Last common checkpoint at 0/30227F8 on timeline 1 > error reading xlog record: record with zero length at 0/3022860 > Done! > > Contents of pg_xlog directory in Primary and Standby 1: > thom@swift /tmp $ ls -l primary/pg_xlog/ > total 49156 > -rw--- 1 thom users 16777216 May 23 09:52 00010002 > -rw--- 1 thom users 16777216 May 23 09:52 00010003 > -rw--- 1 thom users 16777216 May 23 09:52 00020003 > -rw--- 1 thom users 41 May 23 09:52 0002.history > drwx-- 2 thom users 80 May 23 09:52 archive_status > thom@swift /tmp $ ls -l standby1/pg_xlog/ > total 49156 > -rw--- 1 thom users 16777216 May 23 09:49 00010002 > -rw--- 1 thom users 16777216 May 23 09:50 00010003 > -rw--- 1 thom users 16777216 May 23 09:52 00020003 > -rw--- 1 thom users 41 May 23 09:50 0002.history > drwx-- 2 thom users 80 May 23 09:50 archive_status > > 5) Changed recovery.done in primary to point its primary_conninfo port > to 5531 (that of Standby 1). > 6) Renamed it to .conf. > 7) Changed postgresql.conf to set the port back to its original one > (as pg_rewind has caused it to match that of Standby 1) > 8) Start Primary > > Latest log in primary reads: > > LOG: database system was interrupted; last known up at 2013-05-23 09:50:34 > EDT > LOG: entering standby mode > LOG: invalid xl_info in checkpoint record > FATAL: could not locate required checkpoint record > HINT: If you are not restoring from a backup, try removing the file > "/tmp/primary/backup_label". > LOG: startup process (PID 31503) exited with exit code 1 > LOG: aborting startup due to startup process failure > > 9) Okay, so I'll delete that label and try again. Now all is well: > > LOG: database system was interrupted; last known up at 2013-05-23 09:50:34 > EDT > LOG: entering standby mode > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: redo starts at 0/3022828 > LOG: record with zero length at 0/3041A60 > LOG: consistent recovery state reached at 0/3041A60 > LOG: database system is ready to accept read only connections > LOG: started streaming WAL from primary at 0/300 on timeline 2 > > 10) Connect to Standby 1 and insert more rows into our original table. > 11) Connect to Primary and those rows are appearing. By the way, without any data inserted I get: thom@swift /tmp $ pg_rewind --target-pgdata=/tmp/primary --source-server='host=localhost port=5531 dbname=postgres' -v connected to remote server fetched file "global/pg_control", length 8192 fetched file "pg_xlog/0002.hist
Re: [HACKERS] getting rid of freezing
On 2013-05-23 22:09:02 -0400, Robert Haas wrote: > On Thu, May 23, 2013 at 1:51 PM, Andres Freund wrote: > > So, what I propose instead is basically: > > 1) only vacuum non-all-visible pages, even when doing it for > >anti-wraparound > > Check. We might want an option to force a scan of the whole relation. Yea, thought of that as well. VACUUM (DEEP) ;). > > 3) When we cannot mark a page all-visible or we cannot get the cleanup > >lock, remember the oldest xmin on that page. We could set all visible > >in the former case, but we want the page to be cleaned up sometime > >soonish. > I think you mean "in the latter case" not "in the former case". If > not, then I'm confused. Uh. Yes. > > We don't even necessarily need to log the hint bits for all items since > > the redo for all_visible could make sure all items are hinted. The only > > problem is knowing up to where we can truncate pg_clog... > [all-visible cannot restore hint bits without FPI because of torn pages] I haven't yet thought about this sufficiently yet. I think we might have a chance of working around this, let me ponder a bit. But even if that means needing a full page write via the usual mechanism for all visible if any hint bits needed to be set we are still out far ahead of the current state imo. * cleanup would quite possibly do an FPI shortly after in vacuum anyway. If we do it for all visible, it possibly does not need to be done for it. * freezing would FPI almost guaranteedly since we do it so much later. * Not having to rescan the whole heap will be a bigger cost saving... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move unused buffers to freelist
On 5/14/13 8:42 AM, Amit Kapila wrote: In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && refcount = 0) buffer’s to end of freelist. I have implemented a new API StrategyMoveBufferToFreeListEnd() to move buffer’s to end of freelist. Instead of a separate function, would it be better to add an argument to StrategyFreeBuffer? ISTM this is similar to the other strategy stuff in the buffer manager, so perhaps it should mirror that... -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 5/24/13 8:21 AM, Robert Haas wrote: On Thu, May 23, 2013 at 7:27 PM, Greg Smith wrote: >I'm working on a new project here that I wanted to announce, just to keep >from duplicating effort in this area. I've started to add a cost limit >delay for regular statements. The idea is that you set a new >statement_cost_delay setting before running something, and it will restrict >total resources the same way autovacuum does. I'll be happy with it when >it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY. Cool. We have an outstanding customer request for this type of functionality; although in that case, I think the desire is more along the lines of being able to throttle writes rather than reads. But I wonder if we wouldn't be better off coming up with a little more user-friendly API. Instead of exposing a cost delay, a cost limit, and various charges, perhaps we should just provide limits measured in KB/s, like dirty_rate_limit = and read_rate_limit = . This is less powerful than what we currently offer for autovacuum, which allows you to come up with a "blended" measure of when vacuum has done too much work, but I don't have a lot of confidence that it's better in practice. Doesn't that hit the old issue of not knowing if a read came from FS cache or disk? I realize that the current cost_delay mechanism suffers from that too, but since the API is lower level that restriction is much more apparent. Instead of KB/s, could we look at how much time one process is spending waiting on IO vs the rest of the cluster? Is it reasonable for us to measure IO wait time for every request, at least on the most popular OSes? -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] background processes vs. hot standby
Hi, On 2013-05-24 09:48:03 -0400, Robert Haas wrote: > CheckRequiredParameterValues() has some code that, when hot standby is > in use, checks the values of max_connections, > max_prepared_transactions, and max_locks_per_transaction against the > master. The comment says "we must have at least as many backend > slots as the primary" ... but the code no longer enforces that, > because we now compute MaxBackends like this: > > MaxBackends = MaxConnections + autovacuum_max_workers + 1 + > GetNumShmemAttachedBgworkers(); > > If GetNumShmemAttachedBgworkers() returns a lower value on the standby > than it did on the master, then we might well have fewer backend slots > on the standby. I'm having trouble remembering why it's a problem to > have fewer backend slots on the standby than the master, but if we > need to prevent that then this code is no longer adequate to the task. It's afair important because we need to allocate shared memory which can keep track of the maximum number of xids (toplevel * max_non_suboverflowed_subxids) in progress. That's the KnownAssignedXids* stuff in procarray.c. > The comment doesn't explain why we check max_locks_per_transaction. I > thought the reason for that check was that we needed to ensure that > there were at least as many lock table slots on the standby as there > were on the master, to prevent bad things from happening later. That > was already not true, since the existing code didn't enforce any > limitation on autovacuum_max_workers on the standby side. Maybe that > doesn't matter, since autovacuum workers can't run in hot standby > mode; not sure. But the addition of background workers to MaxBackends > provides another way for that to be not true. Here's how we compute > the size of the lock table: Yea, we need it exactly for that reason. I think its unlikely to cause actual problems since we only ship access exclusive locks to the standby and its hard to see scenarios where we have that many AEL on the primary. But we probably should fix it anyway. I think fixing the autovacuum_max_workers case is entirely reasonable and relatively unlikely to cause problems. I don't think we can easily do it in a minor release though since I don't see a way to transport knowledge about it via the WAL without breaking either the WAL format entirely or change the meaning of MaxConnections in ControlFile which would cause problems with upgrading the primary first. I am less excited about doing something similar for the background worker case. Requiring just as many background workers on the standby sounds like a bad idea to me, there seem to be too many cases where that doesn't seem to make sense. I wonder if we shouldn't make background workers use connections slots from max_connections similar to how superuser_reserved_connections work. That would mean we don't need to care about it for HS. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] background processes vs. hot standby
CheckRequiredParameterValues() has some code that, when hot standby is in use, checks the values of max_connections, max_prepared_transactions, and max_locks_per_transaction against the master. The comment says "we must have at least as many backend slots as the primary" ... but the code no longer enforces that, because we now compute MaxBackends like this: MaxBackends = MaxConnections + autovacuum_max_workers + 1 + GetNumShmemAttachedBgworkers(); If GetNumShmemAttachedBgworkers() returns a lower value on the standby than it did on the master, then we might well have fewer backend slots on the standby. I'm having trouble remembering why it's a problem to have fewer backend slots on the standby than the master, but if we need to prevent that then this code is no longer adequate to the task. The comment doesn't explain why we check max_locks_per_transaction. I thought the reason for that check was that we needed to ensure that there were at least as many lock table slots on the standby as there were on the master, to prevent bad things from happening later. That was already not true, since the existing code didn't enforce any limitation on autovacuum_max_workers on the standby side. Maybe that doesn't matter, since autovacuum workers can't run in hot standby mode; not sure. But the addition of background workers to MaxBackends provides another way for that to be not true. Here's how we compute the size of the lock table: #define NLOCKENTS() \ mul_size(max_locks_per_xact, add_size(MaxBackends, max_prepared_xacts)) Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On Thu, May 23, 2013 at 7:27 PM, Greg Smith wrote: > I'm working on a new project here that I wanted to announce, just to keep > from duplicating effort in this area. I've started to add a cost limit > delay for regular statements. The idea is that you set a new > statement_cost_delay setting before running something, and it will restrict > total resources the same way autovacuum does. I'll be happy with it when > it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY. Cool. We have an outstanding customer request for this type of functionality; although in that case, I think the desire is more along the lines of being able to throttle writes rather than reads. But I wonder if we wouldn't be better off coming up with a little more user-friendly API. Instead of exposing a cost delay, a cost limit, and various charges, perhaps we should just provide limits measured in KB/s, like dirty_rate_limit = and read_rate_limit = . This is less powerful than what we currently offer for autovacuum, which allows you to come up with a "blended" measure of when vacuum has done too much work, but I don't have a lot of confidence that it's better in practice. > Modifying the buffer manager to account for statement-based cost > accumulation isn't difficult. The tricky part here is finding the right > spot to put the delay at. In the vacuum case, it's easy to insert a call to > check for a delay after every block of I/O. It should be possible to find a > single or small number of spots to put a delay check in the executor. But I > expect that every utility command may need to be modified individually to > find a useful delay point. This is starting to remind me of the SEPostgres > refactoring, because all of the per-command uniqueness ends up requiring a > lot of work to modify in a unified way. I haven't looked at this in detail, but I would hope it's not that bad. For one thing, many DDL commands don't do any significant I/O in the first place and so can probably be disregarded. Those that do are mostly things that rewrite the table and things that build indexes. I doubt there are more than 3 or 4 code paths to patch. > The main unintended consequences issue I've found so far is when a cost > delayed statement holds a heavy lock. Autovacuum has some protection > against letting processes with an exclusive lock on a table go to sleep. It > won't be easy to do that with arbitrary statements. There's a certain > amount of allowing the user to shoot themselves in the foot here that will > be time consuming (if not impossible) to eliminate. The person who runs an > exclusive CLUSTER that's limited by statement_cost_delay may suffer from > holding the lock too long. But that might be their intention with setting > the value. Hard to idiot proof this without eliminating useful options too. Well, we *could* have a system where, if someone blocks waiting for a lock held by a rate-limited process, the rate limits are raised or abolished. But I'm pretty sure that's a bad idea. I think that the people who want rate limits want them because allowing too much write (or maybe read?) activity hoses the performance of the entire system, and that's not going to be any less true if there are multiple jobs piling up. Let's say someone has a giant COPY into a huge table, and CLUSTER blocks behind it, waiting for AccessExclusiveLock. Well... making the COPY run faster so that we can hurry up and start CLUSTER-ing seems pretty clearly wrong. We want the COPY to run slower, and we want the CLUSTER to run slower, too. If we don't want that, then, as you say, we shouldn't set the GUC in the first place. Long story short, I'm inclined to define this as expected behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL segments (names) not in a sequence
> Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for > migration (and it is working fine). The streaming replication is for > hot-standby replication *once migrated*. Thing is I disbable archving and > set wal_level to minimal, when migrating the large portion of data, to make > it faster. Then I switch to wal_level=hot_standby, i.e the "production" > configuration, and the WAL segment seuqence seems to overlap with the > segments generated with the other setting. > Though, now you understand it's not what it looks like, right? :-) -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to .gitignore
=?ISO-8859-1?Q?Fabr=EDzio_de_Royes_Mello?= writes: > On Fri, May 24, 2013 at 12:04 AM, Christopher Browne > wrote: >> There hasn't been general agreement on the merits of particular .gitignore >> rules of this sort. > I agree with you about vim-oriented patterns, because its a particular > tool, but "ctags" and "etags" be part of postgres source tree and its > generate some output inside them, so I think we must ignore it. [ shrug... ] Editor backup files get generated inside the source tree as well. Chris stated the policy accurately: if you use tools that leave unexpected files in the source tree, it's up to you to have a personal .gitignore for those, assuming you want them ignored. The project's .gitignore files are only supposed to list files that get generated by the standard build processes. FWIW, my personal .gitexclude file looks like *~ *.orig where the latter one is for junk generated by "patch". I believe patch's extension for saved files varies across versions, so it wouldn't be too sensible to have an exclusion like that in the project-wide file. Note also that I intentionally *don't* have an exclusion for *.rej --- if any patch hunk failed, I want git to mention it. But that's a matter of personal preference. I rather imagine that other people configure it differently, and that's fine. As long as we don't try to put such things in the project-wide exclusion list, we don't have to have a consensus about it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL segments (names) not in a sequence
Hi Sergey, Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for migration (and it is working fine). The streaming replication is for hot-standby replication *once migrated*. Thing is I disbable archving and set wal_level to minimal, when migrating the large portion of data, to make it faster. Then I switch to wal_level=hot_standby, i.e the "production" configuration, and the WAL segment seuqence seems to overlap with the segments generated with the other setting. On Thu, May 23, 2013 at 7:44 PM, Sergey Konoplev wrote: > On Thu, May 23, 2013 at 6:18 AM, German Becker > wrote: > > Let me describe the process I follow to get to this. What I am doing is > > testing a migration from 8.3 to 9.1. They way I plan to do it is the > > following. > > 1) Create the schema > > 2) import the biggest tables, which are not updated,only growing, with > COPY > > (this is about 35gb of data) > > 2)import the small, changing part of the data > > > > > > The target system is 9.1 with streaming relication. > > For steps 1 and 2, I set a "restore" configuration, that amongs other > things > > like more work mem, it sets archive_mode=off and wal_level=minimal > (attached > > the difference between restore and normal). > > The archive_command is just a cp wrapped in a shell script in case I > need to > > change it. > > You can not migrate between any major versions with WAL based or > streaming replication. > > Use either full dump/restore or schema only dump/restore plus trigger > based replication (londiste, slony) to migrate data. > > -- > Kind regards, > Sergey Konoplev > PostgreSQL Consultant and DBA > > Profile: http://www.linkedin.com/in/grayhemp > Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979 > Skype: gray-hemp > Jabber: gray...@gmail.com >
Re: [HACKERS] Removal of pageinspect--1.0.sql
On 20.05.2013 19:50, Michael Paquier wrote: The contrib module pageinspect has been upgraded to 1.1, but pageinspect--1.0.sql is still present in source code. Shouldn't it be removed? Please find patch attached. Yep. Removed, thanks. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WARNING : pgstat wait timeout - Postgres 9.1
About the stats_temp_directory, I didn't run as root... Now I'm sure the configurations are correct. I think, I have too much IO to use stats. I will ever have this message... Maybe I can disable this option. Do you know what it really impact ? Thanks. Math 2013/5/24 Mathieu Guerin > Hello, > > Thanks a lot for your answers. > > > > You should get it... > > stats_temp_directory | > pg_stat_tmp | Writes temporary > statistics files to the specified directory. > > I don't know why i don't get it. I am in 9.1 version... > > Moreover, when I mount pg_stat_tmp in a tmpfs, the warning messages > decrease the warning messages decrease from 1 each minutes to 1 each five > secondes. I don't have any others logs warning but the file pg_stat.stat in > the mounting point is not created... I tryed before on a test environment > and it works... > > If you have any ideas... > Thanks a lot. > Regards, > Math > > > > 2013/5/24 Michael Paquier > >> >> >> >> On Thu, May 23, 2013 at 9:31 PM, Mathieu Guerin < >> mathieu.gueri...@gmail.com> wrote: >> >>> What are the consequences ? Because this file will be remove if the >>> server reboot. >>> >> Those temporary statistics are stored in global directory when server >> shuts down, so the risk here would be to lose a portion of this data in the >> case of a crash, either at PG or at OS level. >> >> >>> If we change the parameter stats_temp_directory is it necessary to >>> reboot the server ? >>> >> No, sending SIGHUP to the server is enough. >> >> >>> When I lauch a SHOW ALL; command, the parameter stats_temp_director is >>> not here. >>> >> You should get it... >> stats_temp_directory| >> pg_stat_tmp | Writes temporary >> statistics files to the specified directory. >> -- >> Michael >> > >
Re: [HACKERS] WARNING : pgstat wait timeout - Postgres 9.1
Hello, Thanks a lot for your answers. > You should get it... > stats_temp_directory | pg_stat_tmp | Writes temporary statistics files to the specified directory. I don't know why i don't get it. I am in 9.1 version... Moreover, when I mount pg_stat_tmp in a tmpfs, the warning messages decrease the warning messages decrease from 1 each minutes to 1 each five secondes. I don't have any others logs warning but the file pg_stat.stat in the mounting point is not created... I tryed before on a test environment and it works... If you have any ideas... Thanks a lot. Regards, Math 2013/5/24 Michael Paquier > > > > On Thu, May 23, 2013 at 9:31 PM, Mathieu Guerin < > mathieu.gueri...@gmail.com> wrote: > >> What are the consequences ? Because this file will be remove if the >> server reboot. >> > Those temporary statistics are stored in global directory when server > shuts down, so the risk here would be to lose a portion of this data in the > case of a crash, either at PG or at OS level. > > >> If we change the parameter stats_temp_directory is it necessary to >> reboot the server ? >> > No, sending SIGHUP to the server is enough. > > >> When I lauch a SHOW ALL; command, the parameter stats_temp_director is >> not here. >> > You should get it... > stats_temp_directory| > pg_stat_tmp | Writes temporary > statistics files to the specified directory. > -- > Michael >
Re: [HACKERS] Patch to .gitignore
>"ctags" and "etags" be part of postgres source tree and its generate some output inside them, so I think we must ignore it. +1 Regards, Amul Sul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers