Re: [HACKERS] pg_stat_lwlock wait time view
2016-08-25 13:46 GMT+09:00 Haribabu Kommi <kommi.harib...@gmail.com>: > On Thu, Aug 25, 2016 at 6:57 AM, Robert Haas <robertmh...@gmail.com> wrote: >> On Wed, Aug 24, 2016 at 4:23 AM, Haribabu Kommi >> <kommi.harib...@gmail.com> wrote: >>> >>> Based on the performance impact with the additional timing calculations, >>> we can decide the view default behavior, Are there any objections to the >>> concept? >> >> There have been some other recent threads on extending the wait event >> stuff. If you haven't already read those, you should, because the >> issues are closely related. I think that timing LWLock waits will be >> quite expensive. I believe that what the Postgres Pro folks want to >> do is add up the wait times or maybe keep a history of waits (though >> maybe I'm wrong about that), but showing them in pg_stat_activity is >> another idea. That's likely to add some synchronization overhead >> which might be even greater in this case than for a feature that just >> publishes accumulated times, but maybe it's all a drop in the bucket >> compared to the cost of calling gettimeofday() in the first place. > > Yes, I agree this is an issue for the cases where the wait time is smaller > than the logic that is added to calculate the wait time. Even if we use > clock_gettime with CLOCK_REALTIME_COARSE there will be some > overhead, as this clock method is 8 times faster than gettimeofday > but not that accurate in result. May be we can use the clock_getime > instead of gettimeofday in this case, as we may not needed the fine-grained > value. Is there any other option (rather than gettimeofday()) to measure elapsed time with lower overhead? I've heard about the RDTSC feature (hardware counter) supported by the recent processors, and have found a few articles [1] [2] on its lower overhead than gettimeofday(). [1] http://stackoverflow.com/questions/15623343/using-cpu-counters-versus-gettimeofday [2] http://stackoverflow.com/questions/6498972/faster-equivalent-of-gettimeofday I'm not sure how we can benefit from it so far, because I'm not familiar with this facility and of course I don't have the numbers. In addition to that, I guess it would bring some portability issues. But I'm still curious to know more about these stuff. Anyone has some experiences on it? >> Personally, my preferred solution is still to have a background worker >> that samples the published wait events and rolls up statistics, but >> I'm not sure I've convinced anyone else. It could report the number >> of seconds since it detected a wait event other than the current one, >> which is not precisely the same thing as tracking the length of the >> current wait but it's pretty close. I don't know for sure what's best >> here - I think some experimentation and dialog is needed. > > Yes, using of background worker can reduce the load of adding all the > wait time calculations in the main backend. I can give a try by modifying > direct calculation approach and background worker (may be pg_stat_collector) > to find the wait time based on the stat messages that are received from > main backend related to wait start and wait end. > > I am not sure with out getting any signal or message from main backend, > how much accurate the data can be gathered from a background worker. It looks a sort of accuracy-performance trade-off. So, I think the use-cases matter here to get a better design. I guess that's the reason why llya is looking for feature requests from DBA in another thread [3]. [3] https://www.postgresql.org/message-id/cag95seuaqvj09kzlwu%2bz1b-gqdmqerzekpfr3hn0q88xzmq...@mail.gmail.com Regards, -- Satoshi Nagayasu <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
2016/08/10 23:22 "Bruce Momjian": > > On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote: > > On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian wrote: > > > > On Tue, Aug 9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote: > > > I hope wait event monitoring will be on by default even if the overhead > > is not > > > almost zero, because the data needs to be readily available for faster > > > troubleshooting. IMO, the benefit would be worth even 10% overhead. If > > you > > > disable it by default because of overhead, how can we convince users to > > enable > > > it in production systems to solve some performance problem? I’m afraid > > severe > > > users would say “we can’t change any setting that might cause more > > trouble, so > > > investigate the cause with existing information.” > > > > If you want to know why people are against enabling this monitoring by > > default, above is the reason. What percentage of people do you think > > would be willing to take a 10% performance penalty for monitoring like > > this? I would bet very few, but the argument above doesn't seem to > > address the fact it is a small percentage. > > > > > > Just two notes from me: > > > > 1) 10% overhead from monitoring wait events is just an idea without any proof > > so soon. > > 2) We already have functionality which trades insight into database with way > > more huge overhead. auto_explain.log_analyze = true can slowdown queries *in > > times*. Do you think we should remove it? > > The point is not removing it, the point is whether > auto_explain.log_analyze = true should be enabled by default, and I > think no one wants to do that. Agreed. If people are facing with some difficult situation in terms of performance, they may accept some (one-time) overhead to resolve the issue. But if they don't have (recognize) any issue, they may not. That's one of the realities according to my experiences. Regards,
Re: [HACKERS] Wait events monitoring future development
2016-08-07 21:03 GMT+09:00 Ilya Kosmodemiansky <ilya.kosmodemian...@postgresql-consulting.com>: > I've summarized Wait events monitoring discussion at Developer unconference > in Ottawa this year on wiki: > > https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring > > (Thanks to Alexander Korotkov for patiently pushing me to make this thing > finally done) Thanks for your effort to make us move forward. > If you attended, fill free to point me out if I missed something, I will put > it on the wiki too. > > Wait event monitoring looks ones again stuck on the way through community > approval in spite of huge progress done last year in that direction. The > importance of the topic is beyond discussion now, if you talk to any > PostgreSQL person about implementing such a tool in Postgres and if the > person does not get exited, probably you talk to a full-time PostgreSQL > developer;-) Obviously it needs a better design, both the user interface and > implementation, and perhaps this is why full-time developers are still > sceptical. > > In order to move forward, imho we need at least some steps, whose steps can > be done in parallel > > 1. Further requirements need to be collected from DBAs. > >If you are a PostgreSQL DBA with Oracle experience and use perf for > troubleshooting Postgres - you are an ideal person to share your experience, > but everyone is welcome. > > 2. Further pg_wait_sampling performance testing needed and in different > environments. > >According to developers, overhead is small, but many people have doubts > that it can be much more significant for intensive workloads. Obviously, it > is not an easy task to test, because you need to put doubtfully > non-production ready code into mission-critical production for such tests. >As a result it will be clear if this design should be abandoned and we > need to think about less-invasive solutions or this design is acceptable. > > Any thoughts? Seems a good starting point. I'm interested in both, and I would like to contribute with running (or writing) several tests. Regards, -- Satoshi Nagayasu <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait events monitoring future development
2016-08-09 11:49 GMT+09:00 Joshua D. Drake <j...@commandprompt.com>: > On 08/08/2016 07:37 PM, Bruce Momjian wrote: >> >> On Tue, Aug 9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote: >>> >>> I hope wait event monitoring will be on by default even if the overhead >>> is not >>> almost zero, because the data needs to be readily available for faster >>> troubleshooting. IMO, the benefit would be worth even 10% overhead. If >>> you >>> disable it by default because of overhead, how can we convince users to >>> enable >>> it in production systems to solve some performance problem? I’m afraid >>> severe >>> users would say “we can’t change any setting that might cause more >>> trouble, so >>> investigate the cause with existing information.” >> >> >> If you want to know why people are against enabling this monitoring by >> default, above is the reason. What percentage of people do you think >> would be willing to take a 10% performance penalty for monitoring like >> this? I would bet very few, but the argument above doesn't seem to >> address the fact it is a small percentage. > > > I would argue it is zero. There are definitely users for this feature but to > enable it by default is looking for trouble. *MOST* users do not need this. I used to think of that this kind of features should be enabled by default, because when I was working at the previous company, I had only few features to understand what is happening inside PostgreSQL by observing production databases. I needed those features enabled in the production databases when I was called. However, now I have another opinion. When we release the next major release saying 10.0 with the wait monitoring, many people will start their benchmark test with a configuration with *the default values*, and if they see some performance decrease, for example around 10%, they will be talking about it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will be facing difficult reputation. So, I agree with the features should be disabled by default for a while. Regards, -- Satoshi Nagayasu <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DROP OWNED BY ... CACADE & "could not open relation with OID" error
2016-07-21 13:53 GMT+09:00 Alvaro Herrera <alvhe...@2ndquadrant.com>: > Satoshi Nagayasu wrote: >> Hi, >> >> I have been trying MADlib [1], a machine-learning library for PostgreSQL, >> and when I was tying it on 9.5 and 9.6beta2, I often got following >> error on my box. >> >> >> madpack.py : ERROR : SQL command failed: >> SQL: DROP OWNED BY madlib_19_installcheck CASCADE; >> ERROR: could not open relation with OID 147056 >> >> >> I wonder it could be a PostgreSQL bug or not. >> AFAIK, this kind of error should not occur during regular SQL command >> execution, >> so I guess this is a PostgreSQL bug. > > Agreed. > >> Should I dig it down deeper to reproduce in a simple procedure? > > Yes, please. Ok. I will try it. Regards, -- Satoshi Nagayasu <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] DROP OWNED BY ... CACADE & "could not open relation with OID" error
Hi, I have been trying MADlib [1], a machine-learning library for PostgreSQL, and when I was tying it on 9.5 and 9.6beta2, I often got following error on my box. madpack.py : ERROR : SQL command failed: SQL: DROP OWNED BY madlib_19_installcheck CASCADE; ERROR: could not open relation with OID 147056 I wonder it could be a PostgreSQL bug or not. AFAIK, this kind of error should not occur during regular SQL command execution, so I guess this is a PostgreSQL bug. Should I dig it down deeper to reproduce in a simple procedure? Regards, [1] http://madlib.incubator.apache.org/ -- Satoshi Nagayasu <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PARALLEL SAFE/UNSAFE question
Hi all, I was trying writing my own parallel aggregation functions on 9.6beta2. During that, we found a bit confusing behavior with SAFE/UNSAFE options. Once a PARALLEL UNSAFE function f1_unsafe() is wrapped by a PARALLEL SAFE function f1_safe_unsafe(), calling f1_safe_unsafe() produces a parallel execution plan despite it implicitly calls the UNSAFE FUNCTION f1_unsafe(). Is this intentional? Please take a look at our example here: https://gist.github.com/snaga/362a965683fb2581bc693991b1fcf721 According to the manual[1], it is described as: > the presence of such a function in an SQL statement forces a serial execution > plan. so, I expected that calling UNSAFE functions should produce a non-parallel execution plan even wrapped in SAFE functions. Is this a sort of limitations? Is this working correctly as we expected? Regards, [1] https://www.postgresql.org/docs/9.6/static/sql-createfunction.html -- Satoshi Nagayasu <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Actuall row count of Parallel Seq Scan in EXPLAIN ANALYZE .
Hi, 2016-06-20 15:42 GMT+09:00 Amit Kapila <amit.kapil...@gmail.com>: > On Mon, Jun 20, 2016 at 11:48 AM, Masahiko Sawada <sawada.m...@gmail.com> > wrote: >> >> Hi all, >> >> My colleague noticed that the output of EXPLAIN ANALYZE doesn't work >> fine for parallel seq scan. >> >> postgres(1)=# explain analyze verbose select count(*) from >> pgbench_accounts ; >> QUERY PLAN >> >> - >> Finalize Aggregate (cost=217018.55..217018.56 rows=1 width=8) >> (actual time=2640.015..2640.015 rows=1 loops=1) >>Output: count(*) >>-> Gather (cost=217018.33..217018.54 rows=2 width=8) (actual >> time=2639.064..2640.002 rows=3 loops=1) >> Output: (PARTIAL count(*)) >> Workers Planned: 2 >> Workers Launched: 2 >> -> Partial Aggregate (cost=216018.33..216018.34 rows=1 >> width=8) (actual time=2632.714..2632.715 rows=1 loops=3) >>Output: PARTIAL count(*) >>Worker 0: actual time=2632.583..2632.584 rows=1 loops=1 >>Worker 1: actual time=2627.517..2627.517 rows=1 loops=1 >>-> Parallel Seq Scan on public.pgbench_accounts >> (cost=0.00..205601.67 rows=417 width=0) (actual >> time=0.042..1685.542 rows=333 loops=3) >> Worker 0: actual time=0.033..1657.486 rows=3457968 >> loops=1 >> Worker 1: actual time=0.039..1702.979 rows=3741069 >> loops=1 >> Planning time: 1.026 ms >> Execution time: 2640.225 ms >> (15 rows) >> >> For example, the above result shows, >> Parallel Seq Scan : actual rows = 333 >> worker 0 : actual rows = 3457968 >> worker 1 : actual rows = 3741069 >> Summation of these is 10532370, but actual total rows is 1000. >> I think that Parallel Seq Scan should show actual rows = >> 1000(total rows) or actual rows = 2800963(rows collected by >> itself). (1000 maybe better) >> > > You have to read the rows at Parallel Seq Scan nodes as total count of rows, > but you have to consider the loops parameter as well. IMHO, "actual rows" of "Parallel Seq Scan" should not be divided by the loops, because the total rows processed here is 1000, not 333 * 3. I think the actual row value shown here "333 " is a bit confusing and tricky for users. >> After spent time to investigate this behaviour, ISTM that the problem >> is nloops of Parallel Seq Scan. >> Parallel Seq Scan is done only once, but nloops is incremented to 3. > > nloops here indicates, that it is done for 2 workers and a master backend. Right, but I'm not sure "loops" is an appropriate word for this context. >> So its "actual rows" is calculated 333(1000 / 3) at >> explain.c:L1223. >> > > Thats how it should be considered. You might want to compare the behaviour > with other cases where value of nloops is used. > >> Is it a bug? >> > > I don't think so. I would like to propose a few modification for parallel queries to make it more clear. >>-> Parallel Seq Scan on public.pgbench_accounts >> (cost=0.00..205601.67 rows=417 width=0) (actual time=0.042..1685.542 >> rows=1000 workers=2) >> Parent: actual time=0.033..1657.486 rows=2800963loops=1 >> Worker 0: actual time=0.033..1657.486 rows=3457968 >> loops=1 >> Worker 1: actual time=0.039..1702.979 rows=3741069 >> loops=1 How about this? Regards, -- Satoshi Nagayasu <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements query jumbling question
On 2015/10/03 6:18, Peter Geoghegan wrote: On Wed, Sep 2, 2015 at 7:41 PM, Satoshi Nagayasu <sn...@uptime.jp> wrote: I know this still needs to be discussed, but I would like to submit a patch for further discussion at the next CF, 2015-11. I think I already expressed this in my explanation of the current behavior, but to be clear: -1 from me to this proposal. I would like to introduce queryId to pgaudit and sql_firewall extensions to determine query groups. queryId could be useful if available in those modules. I think users may want to do that based on object names, because they issue queries with the object names, not the internal object ids. Changing queryId after re-creating the same table may make the user gets confused, because the query string the user issue is not changed. At least, I would like to give some options to be chosen by the user. Is it possible and/or reasonable? Regards, -- NAGAYASU Satoshi <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements query jumbling question
On 2015/09/01 14:39, Satoshi Nagayasu wrote: > On 2015/09/01 14:01, Tom Lane wrote: >> Satoshi Nagayasu <sn...@uptime.jp> writes: >>> On 2015/09/01 13:41, Peter Geoghegan wrote: >>>> If you want to use the queryId field directly, which I recall you >>>> mentioning before, then that's harder. There is simply no contract >>>> among extensions for "owning" a queryId. But when the fingerprinting >>>> code is moved into core, then I think at that point queryId may cease >>>> to be even a thing that pg_stat_statements theoretically has the right >>>> to write into. Rather, it just asks the core system to do the >>>> fingerprinting, and finds it within queryId. At the same time, other >>>> extensions may do the same, and don't need to care about each other. >>>> >>>> Does that work for you? >> >>> Yes. I think so. >> >>> I need some query fingerprint to determine query group. I want queryid >>> to keep the same value when query strings are the same (except literal >>> values). >> >> The problem I've got with this is the unquestioned assumption that every >> application for query IDs will have exactly the same requirements for >> what the ID should include or ignore. > > I'm not confident about that too, but at least, I think we will be able > to collect most common use cases as of today. (aka best guess. :) > > And IMHO it would be ok to change the spec in future release. I know this still needs to be discussed, but I would like to submit a patch for further discussion at the next CF, 2015-11. Regards, -- NAGAYASU Satoshi <sn...@uptime.jp> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 59b8a2e..2f6fb99 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -76,6 +76,8 @@ #include "tcop/utility.h" #include "utils/builtins.h" #include "utils/memutils.h" +#include "utils/rel.h" +#include "utils/relcache.h" PG_MODULE_MAGIC; @@ -2286,6 +2288,7 @@ static void JumbleRangeTable(pgssJumbleState *jstate, List *rtable) { ListCell *lc; + Relation rel; foreach(lc, rtable) { @@ -2296,7 +2299,9 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable) switch (rte->rtekind) { case RTE_RELATION: - APP_JUMB(rte->relid); + rel = RelationIdGetRelation(rte->relid); + APP_JUMB_STRING(RelationGetRelationName(rel)); + RelationClose(rel); JumbleExpr(jstate, (Node *) rte->tablesample); break; case RTE_SUBQUERY: -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_stat_statements query jumbling question
Hi, I have a question on jumbling queries in pg_stat_statements. I found that JumbleRangeTable() uses relation oid in RangeTblEntry. Obviously, this would result different queryid when the table gets re-created (dropped and created). Why don't we use relation name (with looking up the catalog) on query jumbling? For performance reason? Regards, -- NAGAYASU Satoshi-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements query jumbling question
On 2015/09/01 13:41, Peter Geoghegan wrote: On Mon, Aug 31, 2015 at 9:29 PM, Satoshi Nagayasu <sn...@uptime.jp> wrote: BTW, I'm interested in improving the queryid portability now because I'd like to use it in other extensions. :) That's the reason why I'm looking at query jumbling here. Are you interested in having the query fingerprinting/jumbling infrastructure available to all backend code? That seems like a good idea to me generally. Yes. I've been working on the sql_firewall extension[1], which is totally built on top of the pg_stat_statements. [1] http://pgsnaga.blogspot.jp/2015/08/postgresql-sql-firewall.html As of today, sql_firewall has duplicated code for query jumbling. But if it goes into the core, it looks fantastic. I would like to be able to put queryId in log_line_prefix, or to display it within EXPLAIN, and have it available everywhere. I like the idea of per-query log_min_duration_statement settings. Sounds cool. :) If you want to use the queryId field directly, which I recall you mentioning before, then that's harder. There is simply no contract among extensions for "owning" a queryId. But when the fingerprinting code is moved into core, then I think at that point queryId may cease to be even a thing that pg_stat_statements theoretically has the right to write into. Rather, it just asks the core system to do the fingerprinting, and finds it within queryId. At the same time, other extensions may do the same, and don't need to care about each other. Does that work for you? Yes. I think so. I need some query fingerprint to determine query group. I want queryid to keep the same value when query strings are the same (except literal values). Another reason is just because I need to import/export query ids. Regards, -- NAGAYASU Satoshi <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements query jumbling question
On 2015/09/01 14:01, Tom Lane wrote: > Satoshi Nagayasu <sn...@uptime.jp> writes: >> On 2015/09/01 13:41, Peter Geoghegan wrote: >>> If you want to use the queryId field directly, which I recall you >>> mentioning before, then that's harder. There is simply no contract >>> among extensions for "owning" a queryId. But when the fingerprinting >>> code is moved into core, then I think at that point queryId may cease >>> to be even a thing that pg_stat_statements theoretically has the right >>> to write into. Rather, it just asks the core system to do the >>> fingerprinting, and finds it within queryId. At the same time, other >>> extensions may do the same, and don't need to care about each other. >>> >>> Does that work for you? > >> Yes. I think so. > >> I need some query fingerprint to determine query group. I want queryid >> to keep the same value when query strings are the same (except literal >> values). > > The problem I've got with this is the unquestioned assumption that every > application for query IDs will have exactly the same requirements for > what the ID should include or ignore. I'm not confident about that too, but at least, I think we will be able to collect most common use cases as of today. (aka best guess. :) And IMHO it would be ok to change the spec in future release. Regards, -- NAGAYASU Satoshi <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements query jumbling question
On 2015/09/01 12:36, Peter Geoghegan wrote: On Mon, Aug 31, 2015 at 8:32 PM, Satoshi Nagayasu <sn...@uptime.jp> wrote: Why don't we use relation name (with looking up the catalog) on query jumbling? For performance reason? I think that there is a good case for preferring this behavior. While it is a little confusing that pg_stat_statements does not change the representative query string, renaming a table does not make it a substantively different table. There is, IIRC, one case where a string is jumbled directly (CTE name). It's usually not the right thing, IMV. Thanks for the comment. I've never considered that. Interesting. From the users point of view, IMHO, it would be better to avoid confusing if queryid is determined by only visible values -- userid, dbid and query string itself. BTW, I'm interested in improving the queryid portability now because I'd like to use it in other extensions. :) That's the reason why I'm looking at query jumbling here. Thoughts? Regards, -- NAGAYASU Satoshi <sn...@uptime.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assert in pg_stat_statements
2015-08-10 0:04 GMT+09:00 Robert Haas robertmh...@gmail.com: On Sun, Aug 9, 2015 at 1:36 AM, Satoshi Nagayasu sn...@uptime.jp wrote: On 2015/08/08 22:32, Robert Haas wrote: On Sat, Aug 8, 2015 at 5:00 AM, Satoshi Nagayasu sn...@uptime.jp wrote: I just found that pg_stat_statements causes assert when queryId is set by other module, which is loaded prior to pg_stat_statements in the shared_preload_libraries parameter. Theoretically, queryId in the shared memory could be set by other module by design. So, IMHO, pg_stat_statements should not cause assert even if queryId already has non-zero value --- my module has this problem now. Is my understanding correct? Any comments? Seems like an ERROR would be more appropriate than an Assert. Well, it's not that simple, and I'm afraid that it may not fix the issue. Here, let's assume that we have two modules (foo and pg_stat_statements) which calculate, use and store Query-queryId independently. What we may see when two are enabled is following. (1) Enable two modules in shared_preload_libraries. shared_preload_libraries = 'foo,pg_stat_statements' (2) Once a query comes in, a callback function in module foo associated with post_parse_analyze_hook calculates and store queryId in Query-queryId. (3) After that, a callback function (pgss_post_parse_analyze) in pg_stat_statements associated with post_parse_analyze_hook detects that Query-queryId has non-zero value, and asserts it. As a result, we can not have two or more modules that use queryId at the same time. But I think it should be possible because one of the advantages of PostgreSQL is its extensibility. So, I think the problem here is the fact that pg_stat_statements deals with having non-zero value in queryId as error even if it has exact the same value with other module. I'm not too excited about supporting the use case where there are two people using queryId but it just so happens that they always set exactly the same value. That seems like a weird setup. I think so too, but unfortunately, I have to do that to work with 9.4 and 9.5. Wouldn't that mean both modules were applying the same jumble algorithm, and therefore you're doing the work of computing the jumble twice per query? Basically yes, because both modules should work not only together, but also solely. So, my module need to have capability to calculate queryId itself. More precisely, if we have following configuration, shared_preload_libraries = 'foo,pg_stat_statements' my module foo calculates queryId itself, and pg_stat_statements would do the same. (and gets crashed when it has --enable-cassert) If we have following configuration, shared_preload_libraries = 'pg_stat_statements,foo' my module foo can skip queryId calculation because pg_stat_statements has already done that. Of course, my module foo should be able to work solely as following. shared_preload_libraries = 'foo' It would be better to have the second module have an option not to compute the query ID and just do whatever else it does. Then, if you want to use that other module with pg_stat_statements, flip the GUC. Yes, that's exactly what I'm doing in my module, and what I intended in my first patch for pg_stat_statements on this thread. :) Regards, -- NAGAYASU Satoshi sn...@uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assert in pg_stat_statements
2015-08-10 2:23 GMT+09:00 Tom Lane t...@sss.pgh.pa.us: Robert Haas robertmh...@gmail.com writes: I'm not too excited about supporting the use case where there are two people using queryId but it just so happens that they always set exactly the same value. That seems like a weird setup. Wouldn't that mean both modules were applying the same jumble algorithm, and therefore you're doing the work of computing the jumble twice per query? Not only that, but you'd need to keep the two modules in sync, which would be one of the greatest recipes for bugs we've thought of yet. If there's actually a use case for that sort of thing, I would vote for moving the jumble-calculation code into core and making both of the interested extensions do /* Calculate query hash if it's not been done yet */ if (query-queryId == 0) calculate_query_hash(query); I vote for this too. Jumble-calculation is the smartest way to identify query groups, so several modules can take advantages of it if in the core. Regards, -- Satoshi Nagayasu sn...@uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Assert in pg_stat_statements
Hi, I just found that pg_stat_statements causes assert when queryId is set by other module, which is loaded prior to pg_stat_statements in the shared_preload_libraries parameter. Theoretically, queryId in the shared memory could be set by other module by design. So, IMHO, pg_stat_statements should not cause assert even if queryId already has non-zero value --- my module has this problem now. Is my understanding correct? Any comments? Regards, -- NAGAYASU Satoshi sn...@uptime.jp commit b975d7c2fe1b36a3ded1e0960be191466704e0fc Author: Satoshi Nagayasu sn...@uptime.jp Date: Sat Aug 8 08:51:45 2015 + Fix pg_stat_statements to avoid assert failure when queryId already has non-zero value. diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 59b8a2e..84c5200 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -776,8 +776,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query) if (prev_post_parse_analyze_hook) prev_post_parse_analyze_hook(pstate, query); - /* Assert we didn't do this already */ - Assert(query-queryId == 0); + /* Assume that other module has calculated and set queryId */ + if (query-queryId 0) + return; /* Safety check... */ if (!pgss || !pgss_hash) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_filedump patch for 9.5
Hi, I have created a patch for pg_filedump to work with 9.5. Here is a list of changes. * Fix to rename CRC32 macros to work with 9.5. * Fix to add missing DBState: DB_SHUTDOWNED_IN_RECOVERY. * Fix to add missing page flags for Btree and GIN. * Update copyright date. Please take a look. Any comments are welcome. Regards, -- NAGAYASU Satoshi sn...@uptime.jp diff --git a/Makefile b/Makefile index c64dfbf..d80a3a3 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # View README.pg_filedump first # note this must match version macros in pg_filedump.h -FD_VERSION=9.3.0 +FD_VERSION=9.5.0 CC=gcc CFLAGS=-g -O -Wall -Wmissing-prototypes -Wmissing-declarations diff --git a/README.pg_filedump b/README.pg_filedump index 3a04d59..13d51ff 100644 --- a/README.pg_filedump +++ b/README.pg_filedump @@ -2,7 +2,7 @@ pg_filedump - Display formatted contents of a PostgreSQL heap, index, or control file. Copyright (c) 2002-2010 Red Hat, Inc. -Copyright (c) 2011-2014, PostgreSQL Global Development Group +Copyright (c) 2011-2015, PostgreSQL Global Development Group This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by diff --git a/pg_filedump.c b/pg_filedump.c index 6cd02a9..1dfc524 100644 --- a/pg_filedump.c +++ b/pg_filedump.c @@ -3,7 +3,7 @@ *formatting heap (data), index and control files. * * Copyright (c) 2002-2010 Red Hat, Inc. - * Copyright (c) 2011-2014, PostgreSQL Global Development Group + * Copyright (c) 2011-2015, PostgreSQL Global Development Group * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -24,7 +24,7 @@ #include pg_filedump.h -#include utils/pg_crc_tables.h +#include utils/pg_crc.h /* checksum_impl.h uses Assert, which doesn't work outside the server */ #undef Assert @@ -91,7 +91,7 @@ DisplayOptions(unsigned int validOptions) printf (\nVersion %s (for %s) \nCopyright (c) 2002-2010 Red Hat, Inc. - \nCopyright (c) 2011-2014, PostgreSQL Global Development Group\n, + \nCopyright (c) 2011-2015, PostgreSQL Global Development Group\n, FD_VERSION, FD_PG_VERSION); printf @@ -1132,6 +1132,8 @@ FormatSpecial() strcat(flagString, SPLITEND|); if (btreeSection-btpo_flags BTP_HAS_GARBAGE) strcat(flagString, HASGARBAGE|); + if (btreeSection-btpo_flags BTP_INCOMPLETE_SPLIT) + strcat(flagString, INCOMPLETESPLIT|); if (strlen(flagString)) flagString[strlen(flagString) - 1] = '\0'; @@ -1216,6 +1218,10 @@ FormatSpecial() strcat(flagString, LIST|); if (ginSection-flags GIN_LIST_FULLROW) strcat(flagString, FULLROW|); + if (ginSection-flags GIN_INCOMPLETE_SPLIT) + strcat(flagString, INCOMPLETESPLIT|); + if (ginSection-flags GIN_COMPRESSED) + strcat(flagString, COMPRESSED|); if (strlen(flagString)) flagString[strlen(flagString) - 1] = '\0'; printf( GIN Index Section:\n @@ -1340,9 +1346,9 @@ FormatControl() char *dbState; /* Compute a local copy of the CRC to verify the one on disk */ - INIT_CRC32(crcLocal); - COMP_CRC32(crcLocal, buffer, offsetof(ControlFileData, crc)); - FIN_CRC32(crcLocal); + INIT_CRC32C(crcLocal); + COMP_CRC32C(crcLocal, buffer, offsetof(ControlFileData, crc)); + FIN_CRC32C(crcLocal); /* Grab a readable version of the database state */ switch (controlData-state) @@ -1353,6 +1359,9 @@ FormatControl() case DB_SHUTDOWNED: dbState = SHUTDOWNED; break; + case DB_SHUTDOWNED_IN_RECOVERY: + dbState = SHUTDOWNED_IN_RECOVERY; + break; case DB_SHUTDOWNING: dbState = SHUTDOWNING; break; @@ -1400,7 +1409,7 @@ FormatControl() Maximum Index Keys: %u\n TOAST Chunk Size: %u\n Date and Time Type Storage:
Re: [HACKERS] Assert in pg_stat_statements
On 2015/08/08 22:32, Robert Haas wrote: On Sat, Aug 8, 2015 at 5:00 AM, Satoshi Nagayasu sn...@uptime.jp wrote: I just found that pg_stat_statements causes assert when queryId is set by other module, which is loaded prior to pg_stat_statements in the shared_preload_libraries parameter. Theoretically, queryId in the shared memory could be set by other module by design. So, IMHO, pg_stat_statements should not cause assert even if queryId already has non-zero value --- my module has this problem now. Is my understanding correct? Any comments? Seems like an ERROR would be more appropriate than an Assert. Well, it's not that simple, and I'm afraid that it may not fix the issue. Here, let's assume that we have two modules (foo and pg_stat_statements) which calculate, use and store Query-queryId independently. What we may see when two are enabled is following. (1) Enable two modules in shared_preload_libraries. shared_preload_libraries = 'foo,pg_stat_statements' (2) Once a query comes in, a callback function in module foo associated with post_parse_analyze_hook calculates and store queryId in Query-queryId. (3) After that, a callback function (pgss_post_parse_analyze) in pg_stat_statements associated with post_parse_analyze_hook detects that Query-queryId has non-zero value, and asserts it. As a result, we can not have two or more modules that use queryId at the same time. But I think it should be possible because one of the advantages of PostgreSQL is its extensibility. So, I think the problem here is the fact that pg_stat_statements deals with having non-zero value in queryId as error even if it has exact the same value with other module. Regards, -- NAGAYASU Satoshi sn...@uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix to expose a GUC variable, Log_disconnections, to outside of postgres.c
On 2015/07/09 15:30, Heikki Linnakangas wrote: On 07/09/2015 07:19 AM, Satoshi Nagayasu wrote: On 2015/07/09 13:06, Tom Lane wrote: Satoshi Nagayasu sn...@uptime.jp writes: I just found that Log_disconnections value has not been exposed to outside of postgres.c. Despite that, Log_connections has already been exposed. Why would an external module need to touch either one? To check settings of GUC variables from a shared preload library. For example, I'd like to print WARNING in _PG_init() when some GUC variable is disabled on postmaster startup. I still have a hard time seeing why an extension would be interested in Log_disconnections. But hey, extensions do strange things.. Definitely. :) You could use GetConfigOption(). I'm not necessarily opposed to exposing Log_disconnections, but with GetConfigOption() it will work with earlier versions too. That's it! This is what I'm looking for. Thanks a lot. :) Regards, -- NAGAYASU Satoshi sn...@uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fix to expose a GUC variable, Log_disconnections, to outside of postgres.c
Hi, I just found that Log_disconnections value has not been exposed to outside of postgres.c. Despite that, Log_connections has already been exposed. So, I'd like to add attached fix to touch the Log_disconnections value in other modules. Any comments? Regards, -- NAGAYASU Satoshi sn...@uptime.jp commit afffca193b69ea5eb42f5e7273d48a6a82eb7e37 Author: Satoshi Nagayasu sn...@uptime.jp Date: Thu Jul 9 03:42:31 2015 + Fix to expose Log_disconnections GUC variable to the outside postgres.c. diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h index d160304..4c76f30 100644 --- a/src/include/postmaster/postmaster.h +++ b/src/include/postmaster/postmaster.h @@ -25,6 +25,7 @@ extern bool ClientAuthInProgress; extern int PreAuthDelay; extern int AuthenticationTimeout; extern bool Log_connections; +extern bool Log_disconnections; extern bool log_hostname; extern bool enable_bonjour; extern char *bonjour_name; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix to expose a GUC variable, Log_disconnections, to outside of postgres.c
On 2015/07/09 13:06, Tom Lane wrote: Satoshi Nagayasu sn...@uptime.jp writes: I just found that Log_disconnections value has not been exposed to outside of postgres.c. Despite that, Log_connections has already been exposed. Why would an external module need to touch either one? To check settings of GUC variables from a shared preload library. For example, I'd like to print WARNING in _PG_init() when some GUC variable is disabled on postmaster startup. Regards, -- NAGAYASU Satoshi sn...@uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind in contrib
Hi, On 2014/12/12 23:13, Heikki Linnakangas wrote: Hi, I'd like to include pg_rewind in contrib. I originally wrote it as an external project so that I could quickly get it working with the existing versions, and because I didn't feel it was quite ready for production use yet. Now, with the WAL format changes in master, it is a lot more maintainable than before. Many bugs have been fixed since the first prototypes, and I think it's fairly robust now. I propose that we include pg_rewind in contrib/ now. Attached is a patch for that. It just includes the latest sources from the current pg_rewind repository at https://github.com/vmware/pg_rewind. It is released under the PostgreSQL license. For those who are not familiar with pg_rewind, it's a tool that allows repurposing an old master server as a new standby server, after promotion, even if the old master was not shut down cleanly. That's a very often requested feature. I'm looking into pg_rewind with a very first scenario. My scenario is here. https://github.com/snaga/pg_rewind_test/blob/master/pg_rewind_test.sh At least, I think a file descriptor srcf should be closed before exiting copy_file_range(). I got can't open file error with too many open file while running pg_rewind. diff --git a/contrib/pg_rewind/copy_fetch.c b/contrib/pg_rewind/copy_fetch.c index bea1b09..5a8cc8e 100644 --- a/contrib/pg_rewind/copy_fetch.c +++ b/contrib/pg_rewind/copy_fetch.c @@ -280,6 +280,8 @@ copy_file_range(const char *path, off_t begin, off_t end, bool trunc) write_file_range(buf, begin, readlen); begin += readlen; } + + close(srcfd); } /* And I have one question here. pg_rewind assumes that the source PostgreSQL has, at least, one checkpoint after getting promoted. I think the target timeline id in the pg_control file to be read is only available after the first checkpoint. Right? Regards, - Heikki -- NAGAYASU Satoshi sn...@uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind in contrib
On 2014/12/16 18:37, Heikki Linnakangas wrote: On 12/16/2014 11:23 AM, Satoshi Nagayasu wrote: pg_rewind assumes that the source PostgreSQL has, at least, one checkpoint after getting promoted. I think the target timeline id in the pg_control file to be read is only available after the first checkpoint. Right? Yes, it does assume that the source server (= old standby, new master) has had at least one checkpoint after promotion. It probably should be more explicit about it: If there hasn't been a checkpoint, you will currently get an error source and target cluster are both on the same timeline, which isn't very informative. Yes, I got the message, so I could find the checkpoint thing. It could be more informative, or some hint message could be added. I assume that by target timeline ID you meant the timeline ID of the source server, i.e. the timeline that the target server should be rewound to. Yes. Target timeline I mean here is the timeline id coming from the promoted master (== source server == old standby). I got it. Thanks. I'm going to look into more details. Regards, - Heikki -- NAGAYASU Satoshi sn...@uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read
(2013/10/11 7:32), Mark Kirkwood wrote: On 11/10/13 11:09, Mark Kirkwood wrote: On 16/09/13 16:20, Satoshi Nagayasu wrote: (2013/09/15 11:07), Peter Eisentraut wrote: On Sat, 2013-09-14 at 16:18 +0900, Satoshi Nagayasu wrote: I'm looking forward to seeing more feedback on this approach, in terms of design and performance improvement. So, I have submitted this for the next CF. Your patch fails to build: pgstattuple.c: In function ‘pgstat_heap_sample’: pgstattuple.c:737:13: error: ‘SnapshotNow’ undeclared (first use in this function) pgstattuple.c:737:13: note: each undeclared identifier is reported only once for each function it appears in Thanks for checking. Fixed to eliminate SnapshotNow. This seems like a cool idea! I took a quick look, and initally replicated the sort of improvement you saw: bench=# explain analyze select * from pgstattuple('pgbench_accounts'); QUERY PLAN Function Scan on pgstattuple (cost=0.00..0.01 rows=1 width=72) (actual time=786.368..786.369 rows=1 loops=1) Total runtime: 786.384 ms (2 rows) bench=# explain analyze select * from pgstattuple2('pgbench_accounts'); NOTICE: pgstattuple2: SE tuple_count 0.00, tuple_len 0.00, dead_tuple_count 0.00, dead_tuple_len 0.00, free_space 0.00 QUERY PLAN Function Scan on pgstattuple2 (cost=0.00..0.01 rows=1 width=72) (actual time=12.004..12.005 rows=1 loops=1) Total runtime: 12.019 ms (2 rows) I wondered what sort of difference eliminating caching would make: $ sudo sysctl -w vm.drop_caches=3 Repeating the above queries: bench=# explain analyze select * from pgstattuple('pgbench_accounts'); QUERY PLAN Function Scan on pgstattuple (cost=0.00..0.01 rows=1 width=72) (actual time=9503.774..9503.776 rows=1 loops=1) Total runtime: 9504.523 ms (2 rows) bench=# explain analyze select * from pgstattuple2('pgbench_accounts'); NOTICE: pgstattuple2: SE tuple_count 0.00, tuple_len 0.00, dead_tuple_count 0.00, dead_tuple_len 0.00, free_space 0.00 QUERY PLAN Function Scan on pgstattuple2 (cost=0.00..0.01 rows=1 width=72) (actual time=12330.630..12330.631 rows=1 loops=1) Total runtime: 12331.353 ms (2 rows) So the sampling code seems *slower* when the cache is completely cold - is that expected? (I have not looked at how the code works yet - I'll dive in later if I get a chance)! Thanks for testing that. It would be very helpful to improve the performance. Quietly replying to myself - looking at the code the sampler does 3000 random page reads... I guess this is slower than 163935 (number of pages in pgbench_accounts) sequential page reads thanks to os readahead on my type of disk (WD Velociraptor). Tweaking the number of random reads (i.e the sample size) down helps - but obviously that can impact estimation accuracy. Thinking about this a bit more, I guess the elapsed runtime is not the *only* theng to consider - the sampling code will cause way less disruption to the os page cache (3000 pages vs possibly lots more than 3000 for reading an entire ralation). Thoughts? I think it could be improved by sorting sample block numbers *before* physical block reads in order to eliminate random access on the disk. pseudo code: -- for (i=0 ; iSAMPLE_SIZE ; i++) { sample_block[i] = random(); } qsort(sample_block); for (i=0 ; iSAMPLE_SIZE ; i++) { buf = ReadBuffer(rel, sample_block[i]); do_some_stats_stuff(buf); } -- I guess it would be helpful for reducing random access thing. Any comments? -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Statistics collection for CLUSTER command
(2013/08/08 20:52), Vik Fearing wrote: As part of routine maintenance monitoring, it is interesting for us to have statistics on the CLUSTER command (timestamp of last run, and number of runs since stat reset) like we have for (auto)ANALYZE and (auto)VACUUM. Patch against today's HEAD attached. I would add this to the next commitfest but I seem to be unable to log in with my community account (I can log in to the wiki). Help appreciated. I have reviewed the patch. Succeeded to build with the latest HEAD, and passed the regression tests. Looks good enough, and I'd like to add a test case here, not only for the view definition, but also working correctly. Please take a look at attached one. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out index 56bace1..ba614b2 100644 --- a/src/test/regress/expected/stats.out +++ b/src/test/regress/expected/stats.out @@ -28,7 +28,13 @@ SELECT pg_sleep(2.0); CREATE TEMP TABLE prevstats AS SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch, (b.heap_blks_read + b.heap_blks_hit) AS heap_blks, - (b.idx_blks_read + b.idx_blks_hit) AS idx_blks + (b.idx_blks_read + b.idx_blks_hit) AS idx_blks, + coalesce(t.last_vacuum, now()) AS last_vacuum, + coalesce(t.last_analyze, now()) AS last_analyze, + coalesce(t.last_cluster, now()) AS last_cluster, + t.vacuum_count, + t.analyze_count, + t.cluster_count FROM pg_catalog.pg_stat_user_tables AS t, pg_catalog.pg_statio_user_tables AS b WHERE t.relname='tenk2' AND b.relname='tenk2'; @@ -111,4 +117,27 @@ SELECT st.heap_blks_read + st.heap_blks_hit = pr.heap_blks + cl.relpages, t| t (1 row) +-- table maintenance stats +ANALYZE tenk2; +VACUUM tenk2; +CLUSTER tenk2 USING tenk2_unique1; +SELECT pg_sleep(1.0); + pg_sleep +-- + +(1 row) + +SELECT st.last_vacuum pr.last_vacuum, + st.last_analyze pr.last_analyze, + st.last_cluster pr.last_cluster, + st.vacuum_count pr.vacuum_count, + st.analyze_count pr.analyze_count, + st.cluster_count pr.cluster_count + FROM pg_stat_user_tables AS st, prevstats pr + WHERE st.relname='tenk2'; + ?column? | ?column? | ?column? | ?column? | ?column? | ?column? +--+--+--+--+--+-- + t| t| t| t| t| t +(1 row) + -- End of Stats Test diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql index bb349b2..71e5e27 100644 --- a/src/test/regress/sql/stats.sql +++ b/src/test/regress/sql/stats.sql @@ -22,7 +22,13 @@ SELECT pg_sleep(2.0); CREATE TEMP TABLE prevstats AS SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch, (b.heap_blks_read + b.heap_blks_hit) AS heap_blks, - (b.idx_blks_read + b.idx_blks_hit) AS idx_blks + (b.idx_blks_read + b.idx_blks_hit) AS idx_blks, + coalesce(t.last_vacuum, now()) AS last_vacuum, + coalesce(t.last_analyze, now()) AS last_analyze, + coalesce(t.last_cluster, now()) AS last_cluster, + t.vacuum_count, + t.analyze_count, + t.cluster_count FROM pg_catalog.pg_stat_user_tables AS t, pg_catalog.pg_statio_user_tables AS b WHERE t.relname='tenk2' AND b.relname='tenk2'; @@ -81,4 +87,20 @@ SELECT st.heap_blks_read + st.heap_blks_hit = pr.heap_blks + cl.relpages, FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr WHERE st.relname='tenk2' AND cl.relname='tenk2'; +-- table maintenance stats +ANALYZE tenk2; +VACUUM tenk2; +CLUSTER tenk2 USING tenk2_unique1; + +SELECT pg_sleep(1.0); + +SELECT st.last_vacuum pr.last_vacuum, + st.last_analyze pr.last_analyze, + st.last_cluster pr.last_cluster, + st.vacuum_count pr.vacuum_count, + st.analyze_count pr.analyze_count, + st.cluster_count pr.cluster_count + FROM pg_stat_user_tables AS st, prevstats pr + WHERE st.relname='tenk2'; + -- End of Stats Test -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
(2013/07/06 1:16), Pavel Stehule wrote: I am sending a patch that removes strict requirements for DROP IF EXISTS statements. This behave is similar to our ALTER IF EXISTS behave now. postgres=# DROP CAST IF EXISTS (sss AS public.casttesttype); NOTICE: types sss and public.casttesttype does not exist, skipping DROP CAST postgres=# DROP FUNCTION IF EXISTS public.pt_in_widget(point, widget); NOTICE: function public.pt_in_widget(point,widget) does not exist, skipping DROP FUNCTION postgres=# DROP OPERATOR IF EXISTS public.% (point, widget); NOTICE: operator public.% does not exist, skipping DROP OPERATOR postgres=# DROP TRIGGER test_trigger_exists ON no_such_table; ERROR: relation no_such_table does not exist postgres=# DROP TRIGGER IF EXISTS test_trigger_exists ON no_such_table; NOTICE: trigger test_trigger_exists for table no_such_table does not exist, skipping DROP TRIGGER This functionality is necessary for correct quite reload from dump without possible warnings I'm looking at this patch, and I have a question here. Should DROP TRIGGER IF EXISTS ignore error for non-existing trigger and non-existing table? Or just only for non-existing trigger? I've not understood the pg_restore issue precisely so far, but IMHO DROP TRIGGER IF EXISTS means if the _trigger_ exists, not if the _table_ exists. Is this a correct and/or an expected behavior? Sorry if I missed some consensus which we already made. Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read
(2013/09/15 11:07), Peter Eisentraut wrote: On Sat, 2013-09-14 at 16:18 +0900, Satoshi Nagayasu wrote: I'm looking forward to seeing more feedback on this approach, in terms of design and performance improvement. So, I have submitted this for the next CF. Your patch fails to build: pgstattuple.c: In function ‘pgstat_heap_sample’: pgstattuple.c:737:13: error: ‘SnapshotNow’ undeclared (first use in this function) pgstattuple.c:737:13: note: each undeclared identifier is reported only once for each function it appears in Thanks for checking. Fixed to eliminate SnapshotNow. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql index 2783a63..8ebec6f 100644 --- a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql +++ b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql @@ -37,3 +37,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass) RETURNS BIGINT AS 'MODULE_PATHNAME', 'pg_relpagesbyid' LANGUAGE C STRICT; + +CREATE FUNCTION pgstattuple2(IN relname regclass, +OUT table_len BIGINT, -- physical table length in bytes +OUT tuple_count BIGINT,-- number of live tuples +OUT tuple_len BIGINT, -- total tuples length in bytes +OUT tuple_percent FLOAT8, -- live tuples in % +OUT dead_tuple_count BIGINT, -- number of dead tuples +OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes +OUT dead_tuple_percent FLOAT8, -- dead tuples in % +OUT free_space BIGINT, -- free space in bytes +OUT free_percent FLOAT8) -- free space in % +AS 'MODULE_PATHNAME', 'pgstattuple2' +LANGUAGE C STRICT; + diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.2.sql index e5fa2f5..963eb00 100644 --- a/contrib/pgstattuple/pgstattuple--1.2.sql +++ b/contrib/pgstattuple/pgstattuple--1.2.sql @@ -77,3 +77,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass) RETURNS BIGINT AS 'MODULE_PATHNAME', 'pg_relpagesbyid' LANGUAGE C STRICT; + +CREATE FUNCTION pgstattuple2(IN relname regclass, +OUT table_len BIGINT, -- physical table length in bytes +OUT tuple_count BIGINT,-- number of live tuples +OUT tuple_len BIGINT, -- total tuples length in bytes +OUT tuple_percent FLOAT8, -- live tuples in % +OUT dead_tuple_count BIGINT, -- number of dead tuples +OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes +OUT dead_tuple_percent FLOAT8, -- dead tuples in % +OUT free_space BIGINT, -- free space in bytes +OUT free_percent FLOAT8) -- free space in % +AS 'MODULE_PATHNAME', 'pgstattuple2' +LANGUAGE C STRICT; + diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c index f9ba0a6..2b29d44 100644 --- a/contrib/pgstattuple/pgstattuple.c +++ b/contrib/pgstattuple/pgstattuple.c @@ -41,9 +41,22 @@ PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(pgstattuple); PG_FUNCTION_INFO_V1(pgstattuplebyid); +PG_FUNCTION_INFO_V1(pgstattuple2); extern Datum pgstattuple(PG_FUNCTION_ARGS); extern Datum pgstattuplebyid(PG_FUNCTION_ARGS); +extern Datum pgstattuple2(PG_FUNCTION_ARGS); + +#define SAMPLE_SIZE 3000 + +typedef struct pgstattuple_block_stats +{ + uint64 tuple_count; + uint64 tuple_len; + uint64 dead_tuple_count; + uint64 dead_tuple_len; + uint64 free_space; /* free/reusable space in bytes */ +} pgstattuple_block_stats; /* * struct pgstattuple_type @@ -66,8 +79,9 @@ typedef void (*pgstat_page) (pgstattuple_type *, Relation, BlockNumber, static Datum build_pgstattuple_type(pgstattuple_type *stat, FunctionCallInfo fcinfo); -static Datum pgstat_relation(Relation rel, FunctionCallInfo fcinfo); +static Datum pgstat_relation(Relation rel, FunctionCallInfo fcinfo, bool enable_sample); static Datum pgstat_heap(Relation rel, FunctionCallInfo fcinfo); +static Datum pgstat_heap_sample(Relation rel, FunctionCallInfo fcinfo); static void pgstat_btree_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno, BufferAccessStrategy bstrategy); @@ -81,6 +95,11 @@ static Datum pgstat_index(Relation rel, BlockNumber start, pgstat_page pagefn, FunctionCallInfo fcinfo); static void pgstat_index_page(pgstattuple_type *stat, Page page, OffsetNumber minoff, OffsetNumber maxoff); +static void compute_parameters(pgstattuple_block_stats *block_stats, + BlockNumber sample_size, BlockNumber nblocks, + uint64 *tuple_count, uint64 *tuple_len, + uint64 *dead_tuple_count
Re: [HACKERS] pg_system_identifier()
Hi, I'm catching up with the discussion as a reviewer... (2013/08/27 5:35), Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-08-26 10:10:54 -0700, Josh Berkus wrote: I'm going to reverse my vote, and vote against this patch. The reason why is that I think we should instead have a function: pg_controldata(parameter text) ... which would report *all* strings in pg_controldata. Hence, you'd do instead: pg_controldata('system identifier') This will hopefully spare us from 15 patches incrementally adding all of the individual items in controldata. If anything but the proposed feature, it should be an SRF - passing in text parameters isn't very discoverable. I'm not pleased with the idea that we'd have to dumb all the relevant datatypes down to text so that we could push them through this single function. Also, what about i18n? pg_controldata localizes all the strings it prints, but I doubt that's a great idea for either the input or the output of this proposed function. How about adding new system view with new function which returns a single pg_controldata value in text type, and using a cast for each column in the view definition? CREATE VIEW pg_catalog.pg_controldata AS SELECT pg_controldata('control_version')::integer AS control_version, pg_controldata('catalog_version')::integer AS catalog_version, pg_controldata('system_identifier')::bigint AS system_identifier, ... pg_controldata('next_xlog_file')::char(25) AS next_xlog_file, ... pg_controldata('encoding')::text AS encoding; Given that the view can work like a SRF, and it allows us to retrieve all the values of pg_controldata with appropriate types in single record from the view: select * from pg_catalog.pg_controldata; To get the system identifier value: select system_identifier from pg_catalog.pg_controldata; Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read
(2013/07/23 20:02), Greg Smith wrote: On 7/23/13 2:16 AM, Satoshi Nagayasu wrote: I've been working on new pgstattuple function to allow block sampling [1] in order to reduce block reads while scanning a table. A PoC patch is attached. Take a look at all of the messages linked in https://commitfest.postgresql.org/action/patch_view?id=778 Jaime and I tried to do what you're working on then, including a random block sampling mechanism modeled on the stats_target mechanism. We didn't do that as part of pgstattuple though, which was a mistake. Noah created some test cases as part of his thorough review that were not computing the correct results. Getting the results correct for all of the various types of PostgreSQL tables and indexes ended up being much harder than the sampling part. See http://www.postgresql.org/message-id/20120222052747.ge8...@tornado.leadboat.com in particular for that. Thanks for the info. I have read the previous discussion. I'm looking forward to seeing more feedback on this approach, in terms of design and performance improvement. So, I have submitted this for the next CF. This new function, pgstattuple2(), samples only 3,000 blocks (which accounts 24MB) from the table randomly, and estimates several parameters of the entire table. There should be an input parameter to the function for how much sampling to do, and if it's possible to make the scale for it to look like ANALYZE that's helpful too. I have a project for this summer that includes reviving this topic and making sure it works on some real-world systems. If you want to work on this too, I can easily combine that project into what you're doing. Yeah, I'm interested in that. Something can be shared? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New statistics for WAL buffer dirty writes
Thanks for checking. Revised one attached. (2013/09/10 6:43), Peter Eisentraut wrote: On 9/6/13 11:32 PM, Satoshi Nagayasu wrote: The revised patch for wal buffer statistics is attached. A test script is also attached. Please take a look. You have duplicate OIDs. Run the script duplicate_oids to find them. -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 23ebc11..cdced7f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1878,6 +1878,13 @@ include 'filename' results in most cases. /para + para +When you see pg_stat_walwriter.dirty_write, which means number +of buffer flushing at buffer full, is continuously increasing +in your running server, you may need to enlarge this buffer +size. + /para + /listitem /varlistentry diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 4ec6981..15d9202 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry + entryOne row only, showing statistics about the wal writer + process's activity. See xref linkend=pg-stat-walwriter-view + for details. + /entry + /row + + row entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry entryOne row per database, showing database-wide statistics. See xref linkend=pg-stat-database-view for details. @@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re single row, containing global data for the cluster. /para + table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter + titlestructnamepg_stat_walwriter/structname View/title + + tgroup cols=3 +thead +row + entryColumn/entry + entryType/entry + entryDescription/entry + /row +/thead + +tbody + row + entrystructfielddirty_writes//entry + entrytypebigint/type/entry + entryNumber of dirty writes, which means flushing wal buffers + because of its full./entry + /row + row + entrystructfieldstats_reset//entry + entrytypetimestamp with time zone/type/entry + entryTime at which these statistics were last reset/entry + /row +/tbody +/tgroup + /table + + para + The structnamepg_stat_walwriter/structname view will always have a + single row, containing global data for the cluster. + /para + table id=pg-stat-database-view xreflabel=pg_stat_database titlestructnamepg_stat_database/structname View/title tgroup cols=3 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dc47c47..d0e85c9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2467,6 +2467,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) WriteRqst.Write = OldPageRqstPtr; WriteRqst.Flush = 0; XLogWrite(WriteRqst, false); + WalWriterStats.m_xlog_dirty_writes++; LWLockRelease(WALWriteLock); TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 575a40f..12a2ed0 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -686,6 +686,11 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; +CREATE VIEW pg_stat_walwriter AS +SELECT +pg_stat_get_xlog_dirty_writes() AS dirty_writes, +pg_stat_get_wal_stat_reset_time() AS stats_reset; + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index b5ce2f6..8c56af5 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -129,6 +129,14 @@ char *pgstat_stat_tmpname = NULL; */ PgStat_MsgBgWriter BgWriterStats; +/* + * WalWriter statistics counter. + * This counter is incremented by each XLogWrite call, + * both in the wal writer process and each backend. + * And then, sent to the stat collector process. + */ +PgStat_MsgWalWriter WalWriterStats; + /* -- * Local data * -- @@ -293,6 +301,7 @@ static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len); static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len); static void
Re: [HACKERS] New statistics for WAL buffer dirty writes
(2013/09/10 22:48), Peter Eisentraut wrote: On 9/10/13 3:37 AM, Satoshi Nagayasu wrote: Thanks for checking. Revised one attached. Please fix compiler warning: walwriter.c: In function ‘WalWriterMain’: walwriter.c:293:3: warning: implicit declaration of function ‘pgstat_send_walwriter’ [-Wimplicit-function-declaration] Thanks. Fixed. -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 23ebc11..cdced7f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1878,6 +1878,13 @@ include 'filename' results in most cases. /para + para +When you see pg_stat_walwriter.dirty_write, which means number +of buffer flushing at buffer full, is continuously increasing +in your running server, you may need to enlarge this buffer +size. + /para + /listitem /varlistentry diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 4ec6981..15d9202 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry + entryOne row only, showing statistics about the wal writer + process's activity. See xref linkend=pg-stat-walwriter-view + for details. + /entry + /row + + row entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry entryOne row per database, showing database-wide statistics. See xref linkend=pg-stat-database-view for details. @@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re single row, containing global data for the cluster. /para + table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter + titlestructnamepg_stat_walwriter/structname View/title + + tgroup cols=3 +thead +row + entryColumn/entry + entryType/entry + entryDescription/entry + /row +/thead + +tbody + row + entrystructfielddirty_writes//entry + entrytypebigint/type/entry + entryNumber of dirty writes, which means flushing wal buffers + because of its full./entry + /row + row + entrystructfieldstats_reset//entry + entrytypetimestamp with time zone/type/entry + entryTime at which these statistics were last reset/entry + /row +/tbody +/tgroup + /table + + para + The structnamepg_stat_walwriter/structname view will always have a + single row, containing global data for the cluster. + /para + table id=pg-stat-database-view xreflabel=pg_stat_database titlestructnamepg_stat_database/structname View/title tgroup cols=3 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dc47c47..d0e85c9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2467,6 +2467,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) WriteRqst.Write = OldPageRqstPtr; WriteRqst.Flush = 0; XLogWrite(WriteRqst, false); + WalWriterStats.m_xlog_dirty_writes++; LWLockRelease(WALWriteLock); TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 575a40f..12a2ed0 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -686,6 +686,11 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; +CREATE VIEW pg_stat_walwriter AS +SELECT +pg_stat_get_xlog_dirty_writes() AS dirty_writes, +pg_stat_get_wal_stat_reset_time() AS stats_reset; + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index b5ce2f6..8c56af5 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -129,6 +129,14 @@ char *pgstat_stat_tmpname = NULL; */ PgStat_MsgBgWriter BgWriterStats; +/* + * WalWriter statistics counter. + * This counter is incremented by each XLogWrite call, + * both in the wal writer process and each backend. + * And then, sent to the stat collector process. + */ +PgStat_MsgWalWriter WalWriterStats; + /* -- * Local data * -- @@ -293,6 +301,7 @@ static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len); static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg
Re: [HACKERS] [rfc] overhauling pgstat.stat
(2013/09/09 8:19), Tomas Vondra wrote: On 8.9.2013 23:04, Jeff Janes wrote: On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu sn...@uptime.jp wrote: Hi, (2013/09/04 13:07), Alvaro Herrera wrote: Satoshi Nagayasu wrote: As you may know, this file could be handreds of MB in size, because pgstat.stat holds all access statistics in each database, and it needs to read/write an entire pgstat.stat frequently. As a result, pgstat.stat often generates massive I/O operation, particularly when having a large number of tables in the database. We already changed it: commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro Herrera alvhe...@alvh.no-ip.org Date: Mon Feb 18 17:56:08 2013 -0300 Split pgstat file in smaller pieces Thanks for the comments. I forgot to mention that. Yes, we have already split single pgstat.stat file into several pieces. However, we still need to read/write large amount of statistics data when we have a large number of tables in single database or multiple databases being accessed. Right? Do you have a test case for measuring this? I vaguely remember from when I was testing the split patch, that I thought that after that improvement the load that was left was so low that there was little point in optimizing it further. This is actually a pretty good point. Creating a synthetic test case is quite simple - just create 1.000.000 tables in a single database, but I'm wondering if it's actually realistic. Do we have a real-world example where the current one stat file per db is not enough? I have several assumptions for that. - Single shared database contains thousands of customers. - Each customer has hundreds of tables and indexes. - Customers are separated by schemas (namespaces) in single database. - Application server uses connection pooling for performance reason. - Workload (locality in the table access) can not be predicted. Looks reasonable? The reason why I worked on the split patch is that our application is slightly crazy and creates a lot of tables (+ indexes) on the fly, and as we have up to a thousand databases on each host, we often ended up with a huge stat file. Splitting the stat file improved that considerably, although that's partially because we have the stats on a tmpfs, so I/O is not a problem, and the CPU overhead is negligible thanks to splitting the stats per database. I agree that splitting a single large database into several pieces, like thousands of tiny databases, could be an option in some cases. However, what I intend here is eliminating those limitations on database design. In fact, when considering connection pooling, splitting a database is not a good idea, because AFAIK, many connection poolers manage connections per database. So, I'd like to support 100k tables in single database. Any comments? Regards, But AFAIK there are operating systems where creating a filesystem in RAM is not that simple - e.g. Windows. In such cases even a moderate number of objects may be a significant issue I/O-wise. But then again, I can't really think of reasonable a system creating that many objects in a single database (except for e.g. a shared database using schemas instead of databases). Tomas -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New statistics for WAL buffer dirty writes
Hi, The revised patch for wal buffer statistics is attached. A test script is also attached. Please take a look. Regards, (2013/07/19 7:49), Satoshi Nagayasu wrote: Will revise and re-resubmit for the next CF. Regards, 2013/07/19 1:06, Alvaro Herrera wrote: What happened to this patch? We were waiting on an updated version from you. Satoshi Nagayasu wrote: (2012/12/10 3:06), Tomas Vondra wrote: On 29.10.2012 04:58, Satoshi Nagayasu wrote: 2012/10/24 1:12, Alvaro Herrera wrote: Satoshi Nagayasu escribi�: With this patch, walwriter process and each backend process would sum up dirty writes, and send it to the stat collector. So, the value could be saved in the stat file, and could be kept on restarting. The statistics could be retreive with using pg_stat_get_xlog_dirty_writes() function, and could be reset with calling pg_stat_reset_shared('walwriter'). Now, I have one concern. The reset time could be captured in globalStats.stat_reset_timestamp, but this value is the same with the bgwriter one. So, once pg_stat_reset_shared('walwriter') is called, stats_reset column in pg_stat_bgwriter does represent the reset time for walwriter, not for bgwriter. How should we handle this? Should we split this value? And should we have new system view for walwriter? I think the answer to the two last questions is yes. It doesn't seem to make sense, to me, to have a single reset timings for what are effectively two separate things. Please submit an updated patch to next CF. I'm marking this one returned with feedback. Thanks. I attached the latest one, which splits the reset_time for bgwriter and walwriter, and provides new system view, called pg_stat_walwriter, to show the dirty write counter and the reset time. I've done a quick review of the v4 patch: Thanks for the review, and sorry for my delayed response. 1) applies fine on HEAD, compiles fine 2) make installcheck fails because of a difference in the 'rules' test suite (there's a new view pg_stat_walwriter - see the attached patch for a fixed version or expected/rules.out) Ah, I forgot about the regression test. I will fix it. Thanks. 3) I do agree with Alvaro that using the same struct for two separate components (bgwriter and walwriter) seems a bit awkward. For example you need to have two separate stat_reset fields, the reset code becomes much more verbose (because you need to list individual fields) etc. So I'd vote to either split this into two structures or keeping it as a single structure (although with two views on top of it). Ok, I will split it into two structs, PgStat_BgWriterGlobalStats and PgStat_WalWriterGlobalStats, and will modify PgStat_GlobalStats to hold those two structs in the stat collector. 4) Are there any other fields that might be interesting? Right now there's just dirty_writes but I guess there are other values. E.g. how much data was actually written etc.? AFAIK, I think those numbers can be obtained by calling pg_current_xlog_insert_location() or pg_current_xlog_location(), but if we need it, I will add it. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 23ebc11..cdced7f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1878,6 +1878,13 @@ include 'filename' results in most cases. /para + para +When you see pg_stat_walwriter.dirty_write, which means number +of buffer flushing at buffer full, is continuously increasing +in your running server, you may need to enlarge this buffer +size. + /para + /listitem /varlistentry diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 4ec6981..15d9202 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry + entryOne row only, showing statistics about the wal writer + process's activity. See xref linkend=pg-stat-walwriter-view + for details. + /entry + /row + + row entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry entryOne row per database, showing database-wide statistics. See xref linkend=pg-stat-database-view for details. @@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re single row, containing global data for the cluster. /para + table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter + titlestructnamepg_stat_walwriter/structname View/title + + tgroup cols=3 +thead +row + entryColumn/entry + entryType/entry + entryDescription/entry + /row +/thead + +tbody
Re: [HACKERS] [rfc] overhauling pgstat.stat
(2013/09/04 15:23), Atri Sharma wrote: Sent from my iPad On 04-Sep-2013, at 10:54, Satoshi Nagayasu sn...@uptime.jp wrote: Hi, (2013/09/04 12:52), Atri Sharma wrote: On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu sn...@uptime.jp wrote: Hi, I'm considering overhauling pgstat.stat, and would like to know how many people are interested in this topic. As you may know, this file could be handreds of MB in size, because pgstat.stat holds all access statistics in each database, and it needs to read/write an entire pgstat.stat frequently. As a result, pgstat.stat often generates massive I/O operation, particularly when having a large number of tables in the database. To support multi-tenancy or just a large number of tables (up to 10k tables in single database), I think pgstat.stat needs to be overhauled. I think using heap and btree in pgstat.stat would be preferred to reduce read/write and to allow updating access statistics for specific tables in pgstat.stat file. Is this good for us? Hi, Nice thought. I/O reduction in pgstat can be really helpful. I am trying to think of our aim here. Would we be looking to split pgstat per table, so that the I/O write happens for only a portion of pgstat? Or reduce the I/O in general? I prefer the latter. Under the current implementation, DBA need to split single database into many smaller databases with considering access locality of the tables. It's difficult and could be change in future. And splitting the statistics data into many files (per table, for example) would cause another performance issue when collecting/showing statistics at once. Just my guess though. So, I'm looking for a new way to reduce I/O for the statistics data in general. Regards, If the later, how would using BTree help us? I would rather go for a range tree or something. But again, I may be completely wrong. Please elaborate a bit more on the solution we are trying to achieve.It seems really interesting. Regards, Atri Right,thanks. How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system of the stats? For example, when you read only a single block from your table, then you need to write all values in your database statistics next. It often generates large amount of i/o operation. However, if random access is allowed in the statistics, you can update only as single record for the specific table which you read. It would be less than 100 bytes for each table. I have no idea about how a priority queue can work here so far. However, if the statistics is overhauled, PostgreSQL would be able to host a much larger number of customers more easily. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [rfc] overhauling pgstat.stat
(2013/09/05 3:50), Pavel Stehule wrote: we very successfully use a tmpfs volume for pgstat files (use a backport of multiple statfiles from 9.3 to 9.1 It works quite well as long as you have the objects (tables, indexes, functions) spread across multiple databases. Once you have one database with very large number of objects, tmpfs is not as effective. It's going to help with stats I/O, but it's not going to help with high CPU usage (you're reading and parsing the stat files over and over) and every rewrite creates a copy of the file. So if you have 400MB stats, you will need 800MB tmpfs + some slack (say, 200MB). That means you'll use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't be used for page cache etc. OTOH, it's true that if you have that many objects, 600MB of RAM is not going to help you anyway. and just idea - can we use a database for storing these files. It can be used in unlogged tables. Second idea - hold a one bg worker as persistent memory key value database and hold data in memory with some optimizations - using anti cache and similar memory database fetures. Yeah, I'm interested in this idea too. If the stat collector has a dedicated connection to the backend in order to store statistics into dedicated tables, we can easily take advantages of index (btree, or hash?) and heap storage. Is this worth trying? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [rfc] overhauling pgstat.stat
(2013/09/05 3:59), Alvaro Herrera wrote: Tomas Vondra wrote: My idea was to keep the per-database stats, but allow some sort of random access - updating / deleting the records in place, adding records etc. The simplest way I could think of was adding a simple index - a mapping of OID to position in the stat file. I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or something like that. This would make it quite simple to access existing record 1: get position from the index 2: read sizeof(Entry) from the file 3: if it's update, just overwrite the bytes, for delete set isdeleted flag (needs to be added to all entries) or reading all the records (just read the whole file as today). Sounds reasonable. However, I think the index should be a real index, i.e. have a tree structure that can be walked down, not just a plain array. If you have a 400 MB stat file, then you must have about 4 million tables, and you will not want to scan such a large array every time you want to find an entry. I thought an array structure at first. But, for now, I think we should have a real index for the statistics data because we already have several index storages, and it will allow us to minimize read/write operations. BTW, what kind of index would be preferred for this purpose? btree or hash? If we use btree, do we need range scan thing on the statistics tables? I have no idea so far. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [rfc] overhauling pgstat.stat
Hi, I'm considering overhauling pgstat.stat, and would like to know how many people are interested in this topic. As you may know, this file could be handreds of MB in size, because pgstat.stat holds all access statistics in each database, and it needs to read/write an entire pgstat.stat frequently. As a result, pgstat.stat often generates massive I/O operation, particularly when having a large number of tables in the database. To support multi-tenancy or just a large number of tables (up to 10k tables in single database), I think pgstat.stat needs to be overhauled. I think using heap and btree in pgstat.stat would be preferred to reduce read/write and to allow updating access statistics for specific tables in pgstat.stat file. Is this good for us? Any comments or suggestions? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [rfc] overhauling pgstat.stat
Hi, (2013/09/04 13:07), Alvaro Herrera wrote: Satoshi Nagayasu wrote: As you may know, this file could be handreds of MB in size, because pgstat.stat holds all access statistics in each database, and it needs to read/write an entire pgstat.stat frequently. As a result, pgstat.stat often generates massive I/O operation, particularly when having a large number of tables in the database. We already changed it: commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro Herrera alvhe...@alvh.no-ip.org Date: Mon Feb 18 17:56:08 2013 -0300 Split pgstat file in smaller pieces Thanks for the comments. I forgot to mention that. Yes, we have already split single pgstat.stat file into several pieces. However, we still need to read/write large amount of statistics data when we have a large number of tables in single database or multiple databases being accessed. Right? I think the issue here is that it is necessary to write/read statistics data even it's not actually changed. So, I'm wondering how we can minimize read/write operations on these statistics data files with using heap and btree. Regards, commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro Herrera alvhe...@alvh.no-ip.org Date: Mon Feb 18 17:56:08 2013 -0300 Split pgstat file in smaller pieces We now write one file per database and one global file, instead of having the whole thing in a single huge file. This reduces the I/O that must be done when partial data is required -- which is all the time, because each process only needs information on its own database anyway. Also, the autovacuum launcher does not need data about tables and functions in each database; having the global stats for all DBs is enough. Catalog version bumped because we have a new subdir under PGDATA. Author: Tomas Vondra. Some rework by Álvaro Testing by Jeff Janes Other discussion by Heikki Linnakangas, Tom Lane. I don't oppose further tweaking, of course, but I wonder if you are considering these changes. -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [rfc] overhauling pgstat.stat
Hi, (2013/09/04 12:52), Atri Sharma wrote: On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu sn...@uptime.jp wrote: Hi, I'm considering overhauling pgstat.stat, and would like to know how many people are interested in this topic. As you may know, this file could be handreds of MB in size, because pgstat.stat holds all access statistics in each database, and it needs to read/write an entire pgstat.stat frequently. As a result, pgstat.stat often generates massive I/O operation, particularly when having a large number of tables in the database. To support multi-tenancy or just a large number of tables (up to 10k tables in single database), I think pgstat.stat needs to be overhauled. I think using heap and btree in pgstat.stat would be preferred to reduce read/write and to allow updating access statistics for specific tables in pgstat.stat file. Is this good for us? Hi, Nice thought. I/O reduction in pgstat can be really helpful. I am trying to think of our aim here. Would we be looking to split pgstat per table, so that the I/O write happens for only a portion of pgstat? Or reduce the I/O in general? I prefer the latter. Under the current implementation, DBA need to split single database into many smaller databases with considering access locality of the tables. It's difficult and could be change in future. And splitting the statistics data into many files (per table, for example) would cause another performance issue when collecting/showing statistics at once. Just my guess though. So, I'm looking for a new way to reduce I/O for the statistics data in general. Regards, If the later, how would using BTree help us? I would rather go for a range tree or something. But again, I may be completely wrong. Please elaborate a bit more on the solution we are trying to achieve.It seems really interesting. Regards, Atri -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/08/30 11:55), Fujii Masao wrote: Hi, Attached patch adds new GUC parameter 'compress_backup_block'. When this parameter is enabled, the server just compresses FPW (full-page-writes) in WAL by using pglz_compress() before inserting it to the WAL buffers. Then, the compressed FPW is decompressed in recovery. This is very simple patch. The purpose of this patch is the reduction of WAL size. Under heavy write load, the server needs to write a large amount of WAL and this is likely to be a bottleneck. What's the worse is, in replication, a large amount of WAL would have harmful effect on not only WAL writing in the master, but also WAL streaming and WAL writing in the standby. Also we would need to spend more money on the storage to store such a large data. I'd like to alleviate such harmful situations by reducing WAL size. My idea is very simple, just compress FPW because FPW is a big part of WAL. I used pglz_compress() as a compression method, but you might think that other method is better. We can add something like FPW-compression-hook for that later. The patch adds new GUC parameter, but I'm thinking to merge it to full_page_writes parameter to avoid increasing the number of GUC. That is, I'm thinking to change full_page_writes so that it can accept new value 'compress'. I measured how much WAL this patch can reduce, by using pgbench. * Server spec CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz Mem: 16GB Disk: 500GB SSD Samsung 840 * Benchmark pgbench -c 32 -j 4 -T 900 -M prepared scaling factor: 100 checkpoint_segments = 1024 checkpoint_timeout = 5min (every checkpoint during benchmark were triggered by checkpoint_timeout) I believe that the amount of backup blocks in WAL files is affected by how often the checkpoints are occurring, particularly under such update-intensive workload. Under your configuration, checkpoint should occur so often. So, you need to change checkpoint_timeout larger in order to determine whether the patch is realistic. Regards, * Result [tps] 1386.8 (compress_backup_block = off) 1627.7 (compress_backup_block = on) [the amount of WAL generated during running pgbench] 4302 MB (compress_backup_block = off) 1521 MB (compress_backup_block = on) At least in my test, the patch could reduce the WAL size to one-third! The patch is WIP yet. But I'd like to hear the opinions about this idea before completing it, and then add the patch to next CF if okay. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/08/30 12:07), Satoshi Nagayasu wrote: (2013/08/30 11:55), Fujii Masao wrote: Hi, Attached patch adds new GUC parameter 'compress_backup_block'. When this parameter is enabled, the server just compresses FPW (full-page-writes) in WAL by using pglz_compress() before inserting it to the WAL buffers. Then, the compressed FPW is decompressed in recovery. This is very simple patch. The purpose of this patch is the reduction of WAL size. Under heavy write load, the server needs to write a large amount of WAL and this is likely to be a bottleneck. What's the worse is, in replication, a large amount of WAL would have harmful effect on not only WAL writing in the master, but also WAL streaming and WAL writing in the standby. Also we would need to spend more money on the storage to store such a large data. I'd like to alleviate such harmful situations by reducing WAL size. My idea is very simple, just compress FPW because FPW is a big part of WAL. I used pglz_compress() as a compression method, but you might think that other method is better. We can add something like FPW-compression-hook for that later. The patch adds new GUC parameter, but I'm thinking to merge it to full_page_writes parameter to avoid increasing the number of GUC. That is, I'm thinking to change full_page_writes so that it can accept new value 'compress'. I measured how much WAL this patch can reduce, by using pgbench. * Server spec CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz Mem: 16GB Disk: 500GB SSD Samsung 840 * Benchmark pgbench -c 32 -j 4 -T 900 -M prepared scaling factor: 100 checkpoint_segments = 1024 checkpoint_timeout = 5min (every checkpoint during benchmark were triggered by checkpoint_timeout) I believe that the amount of backup blocks in WAL files is affected by how often the checkpoints are occurring, particularly under such update-intensive workload. Under your configuration, checkpoint should occur so often. So, you need to change checkpoint_timeout larger in order to determine whether the patch is realistic. In fact, the following chart shows that checkpoint_timeout=30min also reduces WAL size to one-third, compared with 5min timeout, in the pgbench experimentation. https://www.oss.ecl.ntt.co.jp/ossc/oss/img/pglesslog_img02.jpg Regards, Regards, * Result [tps] 1386.8 (compress_backup_block = off) 1627.7 (compress_backup_block = on) [the amount of WAL generated during running pgbench] 4302 MB (compress_backup_block = off) 1521 MB (compress_backup_block = on) At least in my test, the patch could reduce the WAL size to one-third! The patch is WIP yet. But I'd like to hear the opinions about this idea before completing it, and then add the patch to next CF if okay. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inconsistent state after crash recovery
(2013/08/02 21:19), Robert Haas wrote: On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane t...@sss.pgh.pa.us wrote: would you expect crash recovery to notice the disappearance of a file that was touched nowhere in the replayed actions? Eh, maybe not. But should we try harder to detect the unexpected disappearance of one that is? We do, don't we? The replay stuff should complain unless it sees a drop or truncate covering any unaccounted-for pages. Hmm. Yeah. But the OP seems to think it doesn't work. Yes. I'm afraid that. My attached script shows that crash recovery re-creates the lost table file implicitly, and fills some of those blocks (maybe lower ones) with zero without any notice. We can easily observe it by using pg_filedump. Thus, the table file can lose records, but DBA cannot recognize it because no message is left in the server log. I agree that this is not a PostgreSQL bug. However, DBA still needs to detect this table corruption, brought by several components which PostgreSQL relys on, to consider restoring from database backup. If PostgreSQL can detect and tell something about that, it would be really helpful for DBA to make some critical decision. I think PostgreSQL will be able to do that. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp #!/bin/sh PGHOME=/tmp/pgsql PGDATA=/tmp/pgsql/data PATH=${PGHOME}/bin:${PATH} export PGHOME PGDATA PATH createdb testdb psql -e testdbEOF drop table if exists t1; create table t1 ( c1 varchar(20) ); checkpoint; select count(*) from t1; select relname,relfilenode from pg_class where relname='t1'; select oid,* from pg_database where datname=current_database(); insert into t1 values (trim(to_char(generate_series(1,1000), '')) ); -- 1000 records select count(*) from t1; select pg_relation_size('t1'); -- write lower blocks in the table file. checkpoint; -- write backup blocks and new records in WAL insert into t1 values (trim(to_char(generate_series(1001,2000), '')) ); -- 2000 records select count(*) from t1; select pg_relation_size('t1'); EOF tablefile=`psql -A -t testdbEOF select ( select setting from pg_settings where name='data_directory' ) || '/' || ( select pg_relation_filepath(oid) from pg_class where relname='t1' ) ; EOF ` echo $tablefile # force a process crash, then recovery after that. killall -9 postgres # remove the table file. rm -rf $tablefile rm -rf ${PGDATA}/*.pid rm -rf ${PGDATA}/pg_log/* pg_ctl -w -D ${PGDATA} start cat ${PGDATA}/pg_log/* psql -e testdbEOF -- only a backup block and following wal records can be -- recovered here. so the table is going to be inconsistent. -- 1000+ records (inconsistent) select count(*) from t1; select pg_relation_size('t1'); EOF #!/bin/sh PGHOME=/tmp/pgsql PGDATA=/tmp/pgsql/data PATH=${PGHOME}/bin:${PATH} export PGHOME PGDATA PATH createdb testdb psql -e testdbEOF drop table if exists t1; create table t1 ( c1 varchar(20) ); checkpoint; select count(*) from t1; select relname,relfilenode from pg_class where relname='t1'; select oid,* from pg_database where datname=current_database(); insert into t1 values (trim(to_char(generate_series(1,1000), '')) ); select count(*) from t1; select pg_relation_size('t1'); -- write lower blocks in the table file. checkpoint; -- write backup blocks and new records in WAL insert into t1 values (trim(to_char(generate_series(1001,2000), '')) ); select count(*) from t1; select pg_relation_size('t1'); EOF tablefile=`psql -A -t testdbEOF select ( select setting from pg_settings where name='data_directory' ) || '/' || ( select pg_relation_filepath(oid) from pg_class where relname='t1' ) ; EOF ` echo $tablefile # force a process crash, then recovery after that. killall -9 postgres ls -l $tablefile pg_filedump -y -S 8192 $tablefile /tmp/dump_before_drop # remove the table file. rm -rf $tablefile rm -rf ${PGDATA}/*.pid rm -rf ${PGDATA}/pg_log/* pg_ctl -w -D ${PGDATA} start -o -p 5433 cat ${PGDATA}/pg_log/* psql -e testdbEOF -- if execute a checkpoint here, the table would be recovered -- with inconsistent state after crash recovery, -- becaues only the upper blocks can be recovered. select count(*) from t1; select pg_relation_size('t1'); EOF tablefile=`psql -A -t testdbEOF select ( select setting from pg_settings where name='data_directory' ) || '/' || ( select pg_relation_filepath(oid) from pg_class where relname='t1' ) ; EOF ` ls -l $tablefile pg_filedump -y -S 8192 $tablefile /tmp/dump_after_recovery -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1]Commitfest ... over!
(2013/08/03 8:47), Josh Berkus wrote: Given that we can expect to be dealing with more patches per CF in the future, I'd like some feedback about what things would make the CF process more efficient. For that matter, for the first time we tried enforcing some of the rules of CFs this time, and I'd like to hear if people think that helped. The 5-day rule (and the notifications from CFM) seemed to be working for me. It helped me focus on the patch review. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] inconsistent state after crash recovery
Hi, I received a question about inconsistent state after crash recovery. When a table file is broken (or just lost), PostgreSQL can not recover a whole table, and does not show any notice while recoverying. I think it means inconsistent state. (1) create a table, and fill records. (2) process a checkpoint. (3) fill more records. (4) force a crash, and delete the table file. (5) run recovery on restarting. (6) only records after the checkpoint can be recoverd. For example, the attached log shows that PostgreSQL can recover only 1058 records in the table which contains 2000 records before the crash, and does not tell anything in the server log. -- insert into t1 values (trim(to_char(generate_series(1,1000), '')) ); INSERT 0 1000 select count(*) from t1; count --- 1000 (1 row) checkpoint; CHECKPOINT insert into t1 values (trim(to_char(generate_series(1001,2000), '')) ); INSERT 0 1000 select count(*) from t1; count --- 2000 (1 row) (delete the table file) (kill postgres) (restart postgres with recovery) select count(*) from t1; count --- 1058 (1 row) -- Is this expected or acceptable? I think, at least, PostgreSQL should say something about this situation in the server log, because DBA can not recognize this situation if no server log exists. To reproduce it, please check the attached test script. Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp [snaga@devsv03 postgresql.git]$ sh test_recovery.sh createdb: database creation failed: ERROR: database testdb already exists drop table if exists t1; DROP TABLE create table t1 ( c1 varchar(20) ); CREATE TABLE checkpoint; CHECKPOINT select count(*) from t1; count --- 0 (1 row) select relname,relfilenode from pg_class where relname='t1'; relname | relfilenode -+- t1 | 147456 (1 row) select oid,* from pg_database where datname=current_database(); oid | datname | datdba | encoding | datcollate | datctype | datistemplate | datallowconn | datconnlimit | datlastsysoid | datfrozenxid | datminmxid | dattablespace | datacl ---+-++--++--+---+--+--+---+--++---+ 16384 | testdb | 10 |0 | C | C| f | t| -1 | 12895 | 1799 | 1 | 1663 | (1 row) insert into t1 values (trim(to_char(generate_series(1,1000), '')) ); INSERT 0 1000 select count(*) from t1; count --- 1000 (1 row) select pg_relation_size('t1'); pg_relation_size -- 57344 (1 row) checkpoint; CHECKPOINT insert into t1 values (trim(to_char(generate_series(1001,2000), '')) ); INSERT 0 1000 select count(*) from t1; count --- 2000 (1 row) select pg_relation_size('t1'); pg_relation_size -- 106496 (1 row) /tmp/pgsql/data/base/16384/147456 waiting for server to start2013-07-26 13:33:03 JST [@ 14651] LOG: loaded library pg_stat_statements done server started 2013-07-26 13:33:03 JST [@ 14653] LOG: database system was interrupted; last known up at 2013-07-26 13:33:03 JST 2013-07-26 13:33:03 JST [@ 14653] LOG: database system was not properly shut down; automatic recovery in progress 2013-07-26 13:33:03 JST [@ 14653] LOG: redo starts at 0/1D493A0 2013-07-26 13:33:03 JST [@ 14653] LOG: record with zero length at 0/1D5D958 2013-07-26 13:33:03 JST [@ 14653] LOG: redo done at 0/1D5D928 2013-07-26 13:33:03 JST [@ 14653] LOG: last completed transaction was at log time 2013-07-26 13:33:03.158161+09 2013-07-26 13:33:03 JST [@ 14651] LOG: database system is ready to accept connections 2013-07-26 13:33:03 JST [@ 14657] LOG: autovacuum launcher started select count(*) from t1; count --- 1058 (1 row) select pg_relation_size('t1'); pg_relation_size -- 106496 (1 row) createdb: database creation failed: ERROR: database testdb already exists drop table if exists t1; DROP TABLE create table t1 ( c1 varchar(20) ); CREATE TABLE checkpoint; CHECKPOINT select count(*) from t1; count --- 0 (1 row) select relname,relfilenode from pg_class where relname='t1'; relname | relfilenode -+- t1 | 155648 (1 row) select oid,* from pg_database where datname=current_database(); oid | datname | datdba | encoding | datcollate | datctype | datistemplate | datallowconn | datconnlimit | datlastsysoid | datfrozenxid | datminmxid | dattablespace | datacl ---+-++--++--+---+--+--+---+--++---+ 16384
[HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read
Hi, I've been working on new pgstattuple function to allow block sampling [1] in order to reduce block reads while scanning a table. A PoC patch is attached. [1] Re: [RFC] pgstattuple/pgstatindex enhancement http://www.postgresql.org/message-id/ca+tgmoaxjhgz2c4ayfbr9muuvnhgwu4co-cthqpzrwwdtam...@mail.gmail.com This new function, pgstattuple2(), samples only 3,000 blocks (which accounts 24MB) from the table randomly, and estimates several parameters of the entire table. The function calculates the averages of the samples, estimates the parameters (averages and SDs), and shows standard errors (SE) to allow estimating status of the table with statistical approach. And, of course, it reduces number of physical block reads while scanning a bigger table. The following example shows that new pgstattuple2 function runs x100 faster than the original pgstattuple function with well-estimated results. -- postgres=# select * from pgstattuple('pgbench_accounts'); -[ RECORD 1 ]--+--- table_len | 1402642432 tuple_count| 1000 tuple_len | 121000 tuple_percent | 86.27 dead_tuple_count | 182895 dead_tuple_len | 22130295 dead_tuple_percent | 1.58 free_space | 21012328 free_percent | 1.5 Time: 1615.651 ms postgres=# select * from pgstattuple2('pgbench_accounts'); NOTICE: pgstattuple2: SE tuple_count 2376.47, tuple_len 287552.58, dead_tuple_count 497.63, dead_tuple_len 60213.08, free_space 289752.38 -[ RECORD 1 ]--+--- table_len | 1402642432 tuple_count| 9978074 tuple_len | 1207347074 tuple_percent | 86.08 dead_tuple_count | 187315 dead_tuple_len | 22665208 dead_tuple_percent | 1.62 free_space | 23400431 free_percent | 1.67 Time: 15.026 ms postgres=# -- In addition to that, see attached chart to know how pgstattuple2 estimates well during repeating (long-running) pgbench. I understand that pgbench would generate random transactions, and those update operations might not have any skew over the table, so estimating table status seems to be easy in this test. However, I'm still curious to know whether it would work in real-world worklaod. Is it worth having this? Any comment or suggestion? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql index 2783a63..8ebec6f 100644 --- a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql +++ b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql @@ -37,3 +37,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass) RETURNS BIGINT AS 'MODULE_PATHNAME', 'pg_relpagesbyid' LANGUAGE C STRICT; + +CREATE FUNCTION pgstattuple2(IN relname regclass, +OUT table_len BIGINT, -- physical table length in bytes +OUT tuple_count BIGINT,-- number of live tuples +OUT tuple_len BIGINT, -- total tuples length in bytes +OUT tuple_percent FLOAT8, -- live tuples in % +OUT dead_tuple_count BIGINT, -- number of dead tuples +OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes +OUT dead_tuple_percent FLOAT8, -- dead tuples in % +OUT free_space BIGINT, -- free space in bytes +OUT free_percent FLOAT8) -- free space in % +AS 'MODULE_PATHNAME', 'pgstattuple2' +LANGUAGE C STRICT; + diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.2.sql index e5fa2f5..963eb00 100644 --- a/contrib/pgstattuple/pgstattuple--1.2.sql +++ b/contrib/pgstattuple/pgstattuple--1.2.sql @@ -77,3 +77,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass) RETURNS BIGINT AS 'MODULE_PATHNAME', 'pg_relpagesbyid' LANGUAGE C STRICT; + +CREATE FUNCTION pgstattuple2(IN relname regclass, +OUT table_len BIGINT, -- physical table length in bytes +OUT tuple_count BIGINT,-- number of live tuples +OUT tuple_len BIGINT, -- total tuples length in bytes +OUT tuple_percent FLOAT8, -- live tuples in % +OUT dead_tuple_count BIGINT, -- number of dead tuples +OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes +OUT dead_tuple_percent FLOAT8, -- dead tuples in % +OUT free_space BIGINT, -- free space in bytes +OUT free_percent FLOAT8) -- free space in % +AS 'MODULE_PATHNAME', 'pgstattuple2' +LANGUAGE C STRICT; + diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c index 7f41ec3..c2adb75 100644 --- a/contrib/pgstattuple/pgstattuple.c +++ b/contrib/pgstattuple/pgstattuple.c @@ -41,9 +41,22 @@ PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(pgstattuple); PG_FUNCTION_INFO_V1(pgstattuplebyid); +PG_FUNCTION_INFO_V1(pgstattuple2); extern Datum pgstattuple(PG_FUNCTION_ARGS); extern
Re: [HACKERS] New statistics for WAL buffer dirty writes
Will revise and re-resubmit for the next CF. Regards, 2013/07/19 1:06, Alvaro Herrera wrote: What happened to this patch? We were waiting on an updated version from you. Satoshi Nagayasu wrote: (2012/12/10 3:06), Tomas Vondra wrote: On 29.10.2012 04:58, Satoshi Nagayasu wrote: 2012/10/24 1:12, Alvaro Herrera wrote: Satoshi Nagayasu escribi�: With this patch, walwriter process and each backend process would sum up dirty writes, and send it to the stat collector. So, the value could be saved in the stat file, and could be kept on restarting. The statistics could be retreive with using pg_stat_get_xlog_dirty_writes() function, and could be reset with calling pg_stat_reset_shared('walwriter'). Now, I have one concern. The reset time could be captured in globalStats.stat_reset_timestamp, but this value is the same with the bgwriter one. So, once pg_stat_reset_shared('walwriter') is called, stats_reset column in pg_stat_bgwriter does represent the reset time for walwriter, not for bgwriter. How should we handle this? Should we split this value? And should we have new system view for walwriter? I think the answer to the two last questions is yes. It doesn't seem to make sense, to me, to have a single reset timings for what are effectively two separate things. Please submit an updated patch to next CF. I'm marking this one returned with feedback. Thanks. I attached the latest one, which splits the reset_time for bgwriter and walwriter, and provides new system view, called pg_stat_walwriter, to show the dirty write counter and the reset time. I've done a quick review of the v4 patch: Thanks for the review, and sorry for my delayed response. 1) applies fine on HEAD, compiles fine 2) make installcheck fails because of a difference in the 'rules' test suite (there's a new view pg_stat_walwriter - see the attached patch for a fixed version or expected/rules.out) Ah, I forgot about the regression test. I will fix it. Thanks. 3) I do agree with Alvaro that using the same struct for two separate components (bgwriter and walwriter) seems a bit awkward. For example you need to have two separate stat_reset fields, the reset code becomes much more verbose (because you need to list individual fields) etc. So I'd vote to either split this into two structures or keeping it as a single structure (although with two views on top of it). Ok, I will split it into two structs, PgStat_BgWriterGlobalStats and PgStat_WalWriterGlobalStats, and will modify PgStat_GlobalStats to hold those two structs in the stat collector. 4) Are there any other fields that might be interesting? Right now there's just dirty_writes but I guess there are other values. E.g. how much data was actually written etc.? AFAIK, I think those numbers can be obtained by calling pg_current_xlog_insert_location() or pg_current_xlog_location(), but if we need it, I will add it. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument
(2013/07/18 2:31), Fujii Masao wrote: On Tue, Jul 16, 2013 at 3:00 PM, Satoshi Nagayasu sn...@uptime.jp wrote: (2013/07/04 3:58), Fujii Masao wrote: For the test, I just implemented the regclass-version of pg_relpages() (patch attached) and tested some cases. But I could not get that problem. SELECT pg_relpages('hoge');-- OK SELECT pg_relpages(oid) FROM pg_class WHERE relname = 'hoge';-- OK SELECT pg_relpages(relname) FROM pg_class WHERE relname = 'hoge';-- OK In the attached patch, I cleaned up three functions to have two types of arguments for each, text and regclass. pgstattuple(text) pgstattuple(regclass) pgstatindex(text) pgstatindex(regclass) pg_relpages(text) pg_relpages(regclass) I still think a regclass argument is more appropriate for passing relation/index name to a function than text-type, but having both arguments in each function seems to be a good choice at this moment, in terms of backward-compatibility. Docs needs to be updated if this change going to be applied. Yes, please. Updated docs and code comments, etc. PFA. Any comments? 'make installcheck' failed in my machine. Hmm, it works on my box... Do we need to remove pgstattuple--1.1.sql and create pgstattuple--1.1--1.2.sql? +/* contrib/pgstattuple/pgstattuple--1.1.sql */ Typo: s/1.1/1.2 Done. You seem to have forgotten to update pgstattuple.c. Should I change something in pgstattuple.c? I just changed CREATE FUNCTION statement for pgstattuple to replace oid input arg with the regclass. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index fc893d8..957742a 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -4,7 +4,7 @@ MODULE_big = pgstattuple OBJS = pgstattuple.o pgstatindex.o EXTENSION = pgstattuple -DATA = pgstattuple--1.1.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql REGRESS = pgstattuple diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out index ab28f50..eaba306 100644 --- a/contrib/pgstattuple/expected/pgstattuple.out +++ b/contrib/pgstattuple/expected/pgstattuple.out @@ -11,12 +11,24 @@ select * from pgstattuple('test'::text); 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 (1 row) +select * from pgstattuple('test'::name); + table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent +---+-+---+---+--++++-- + 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 +(1 row) + select * from pgstattuple('test'::regclass); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ---+-+---+---+--++++-- 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 (1 row) +select * from pgstattuple('test'::regclass::oid); + table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent +---+-+---+---+--++++-- + 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 +(1 row) + select * from pgstatindex('test_pkey'); version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation -+++---+++-+---+--+ diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c index 97f897e..41e90e3 100644 --- a/contrib/pgstattuple/pgstatindex.c +++ b/contrib/pgstattuple/pgstatindex.c @@ -39,12 +39,24 @@ #include utils/rel.h +/* + * Because of backward-compatibility issue, we have decided to have + * two types of interfaces, with regclass-type input arg and text-type + * input arg, for each function. + * + * Those functions which have text-type input arg will be depreciated + * in the future release. + */ extern Datum pgstatindex(PG_FUNCTION_ARGS); +extern Datum pgstatindexbyid
Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument
(2013/07/04 3:58), Fujii Masao wrote: On Wed, Jun 26, 2013 at 12:39 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 20, 2013 at 2:32 PM, Fujii Masao masao.fu...@gmail.com wrote: Since pg_relpages(oid) doesn't exist, pg_relpages() is in the same situation as pgstatindex(), i.e., we cannot just replace pg_relpages(text) with pg_relpages(regclass) for the backward-compatibility. How do you think we should solve the pg_relpages() problem? Rename? Just add pg_relpages(regclass)? Adding a function with a new name seems likely to be smoother, since that way you don't have to worry about problems with function calls being thought ambiguous. Could you let me know the example that this problem happens? For the test, I just implemented the regclass-version of pg_relpages() (patch attached) and tested some cases. But I could not get that problem. SELECT pg_relpages('hoge');-- OK SELECT pg_relpages(oid) FROM pg_class WHERE relname = 'hoge';-- OK SELECT pg_relpages(relname) FROM pg_class WHERE relname = 'hoge';-- OK In the attached patch, I cleaned up three functions to have two types of arguments for each, text and regclass. pgstattuple(text) pgstattuple(regclass) pgstatindex(text) pgstatindex(regclass) pg_relpages(text) pg_relpages(regclass) I still think a regclass argument is more appropriate for passing relation/index name to a function than text-type, but having both arguments in each function seems to be a good choice at this moment, in terms of backward-compatibility. Docs needs to be updated if this change going to be applied. Any comments? -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index fc893d8..957742a 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -4,7 +4,7 @@ MODULE_big = pgstattuple OBJS = pgstattuple.o pgstatindex.o EXTENSION = pgstattuple -DATA = pgstattuple--1.1.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql REGRESS = pgstattuple diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out index ab28f50..eaba306 100644 --- a/contrib/pgstattuple/expected/pgstattuple.out +++ b/contrib/pgstattuple/expected/pgstattuple.out @@ -11,12 +11,24 @@ select * from pgstattuple('test'::text); 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 (1 row) +select * from pgstattuple('test'::name); + table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent +---+-+---+---+--++++-- + 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 +(1 row) + select * from pgstattuple('test'::regclass); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ---+-+---+---+--++++-- 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 (1 row) +select * from pgstattuple('test'::regclass::oid); + table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent +---+-+---+---+--++++-- + 0 | 0 | 0 | 0 |0 | 0 | 0 | 0 |0 +(1 row) + select * from pgstatindex('test_pkey'); version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation -+++---+++-+---+--+ diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c index 97f897e..9ec74e7 100644 --- a/contrib/pgstattuple/pgstatindex.c +++ b/contrib/pgstattuple/pgstatindex.c @@ -40,11 +40,15 @@ extern Datum pgstatindex(PG_FUNCTION_ARGS); +extern Datum pgstatindexbyid(PG_FUNCTION_ARGS); extern Datum pg_relpages(PG_FUNCTION_ARGS); +extern Datum pg_relpagesbyid(PG_FUNCTION_ARGS); extern Datum pgstatginindex(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(pgstatindex); +PG_FUNCTION_INFO_V1(pgstatindexbyid
Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument
Hi Rushabh, (2013/07/16 14:58), Rushabh Lathia wrote: Hello Satoshi, I assigned myself for the reviewer of this patch. Issue status is waiting on author. Thank you for picking it up. Now looking at the discussion under the thread it seems like we are waiting for the suggestion for the new function name, right ? Yes. I am wondering why actually we need new name ? Can't we just overload the same function and provide two version of the functions ? I think the major reason is to avoid some confusion with old and new function arguments. My thought here is that having both arguments (text and regclass) for each function is a good choice to clean up interfaces with keeping the backward-compatibility. In the last thread Fujii just did the same for pg_relpages and it seems like an good to go approach, isn't it ? Am I missing anything here ? I just posted a revised patch to handle the issue in three functions of the pgstattuple module. Please take a look. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add visibility map information to pg_freespace.
(2013/07/09 19:55), Kyotaro HORIGUCHI wrote: Hello, I've brought visibilitymap extentions for pg_freespacemap and pgstattuple. At Mon, 08 Jul 2013 16:59:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote in 20130708.165905.118860769.horiguchi.kyot...@lab.ntt.co.jp I'll come again with the first implementation of it. And as for pg_freespacemap, I'll keep the current direction - adding column to present output records format of pg_freespace(). And documentation, if possible. pg_freespace_vm_v2.patch: Interface has been changed from the first patch. The version of pg_freespace() provided with vm information is named pg_freespace_with_vminfo() and shows output like following. | postgres=# select * from pg_freespace_with_vminfo('t'::regclass) limit 10; | blkno | avail | is_all_visible | ---+---+ | 0 |64 | t | 1 |32 | t | 2 |96 | t | 3 |64 | t | 4 |96 | t | 5 |96 | t | 6 | 128 | t | 7 |32 | t | 8 |96 | t I think we can simply add is_all_viible column to the existing pg_freespace(), because adding column would not break backward-compatibility in general. Any other thoughts? pgstattuple_vm_v1.patch: The first version of VM extension for pgstattuple. According to the previous discussion, the added column is named 'all_visible_percent'. | postgres=# select * from pgstattuple('t'); | -[ RECORD 1 ]---+- | table_len | 71770112 | tuple_count | 989859 | tuple_len | 31675488 | tuple_percent | 44.13 | dead_tuple_count| 99 | dead_tuple_len | 3168 | dead_tuple_percent | 0 | free_space | 31886052 | free_percent| 44.43 | all_visible_percent | 99.98 It seems working fine. And I added a regression test for pg_freespacemap and additional test cases for pgstattuple. Please take a look. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile index d794df2..09d6ff8 100644 --- a/contrib/pg_freespacemap/Makefile +++ b/contrib/pg_freespacemap/Makefile @@ -6,6 +6,8 @@ OBJS = pg_freespacemap.o EXTENSION = pg_freespacemap DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.0--1.1.sql pg_freespacemap--unpackaged--1.0.sql +REGRESS = pg_freespacemap + ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) diff --git a/contrib/pg_freespacemap/expected/pg_freespacemap.out b/contrib/pg_freespacemap/expected/pg_freespacemap.out new file mode 100644 index 000..cde954d --- /dev/null +++ b/contrib/pg_freespacemap/expected/pg_freespacemap.out @@ -0,0 +1,100 @@ +create extension pg_freespacemap; +create table t1 ( uid integer primary key, uname text not null ); +select * from pg_freespace('t1'); + blkno | avail +---+--- +(0 rows) + +select * from pg_freespace('t1'::regclass); + blkno | avail +---+--- +(0 rows) + +select * from pg_freespace('t1', 1); + pg_freespace +-- +0 +(1 row) + +select * from pg_freespace_with_vminfo('t1'); + blkno | avail | is_all_visible +---+---+ +(0 rows) + +select * from pg_freespace_with_vminfo('t1'::regclass); + blkno | avail | is_all_visible +---+---+ +(0 rows) + +insert into t1 values ( 100, 'postgresql' ); +select * from pg_freespace('t1'); + blkno | avail +---+--- + 0 | 0 +(1 row) + +select * from pg_freespace('t1', 1); + pg_freespace +-- +0 +(1 row) + +select * from pg_freespace_with_vminfo('t1'); + blkno | avail | is_all_visible +---+---+ + 0 | 0 | f +(1 row) + +select * from pg_freespace('t1_pkey'); + blkno | avail +---+--- + 0 | 0 + 1 | 0 +(2 rows) + +select * from pg_freespace('t1_pkey', 1); + pg_freespace +-- +0 +(1 row) + +select * from pg_freespace('t1_pkey', 2); + pg_freespace +-- +0 +(1 row) + +select * from pg_freespace_with_vminfo('t1_pkey'); + blkno | avail | is_all_visible +---+---+ + 0 | 0 | f + 1 | 0 | f +(2 rows) + +vacuum t1; +select * from pg_freespace('t1'); + blkno | avail +---+--- + 0 | 8096 +(1 row) + +select * from pg_freespace_with_vminfo('t1'); + blkno | avail | is_all_visible +---+---+ + 0 | 8096 | t +(1 row) + +select * from pg_freespace('t1_pkey'); + blkno | avail +---+--- + 0 | 0 + 1 | 0 +(2 rows) + +select * from pg_freespace_with_vminfo('t1_pkey'); + blkno | avail | is_all_visible +---+---+ + 0 | 0 | f + 1 | 0 | f +(2 rows) + diff --git a/contrib/pg_freespacemap/sql/pg_freespacemap.sql b/contrib/pg_freespacemap/sql/pg_freespacemap.sql new file mode 100644 index 000..79a458d --- /dev/null +++ b/contrib
Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
Hi, I have reviewed this patch as a CF reviewer. (2013/06/27 4:07), Jeff Davis wrote: On Mon, 2013-06-24 at 20:34 -0400, Josh Kupershmidt wrote: This patch is in the current CommitFest, does it still need to be reviewed? If so, I notice that the version in pgfoundry's CVS is rather different than the version the patch seems to have been built against (presumably the pg_filedump-9.2.0.tar.gz release), and conflicts in several places with cvs tip. Rebased against CVS tip; attached. It looks fine, but I have one question here. When I run pg_filedump with -k against a database cluster which does not support checksums, pg_filedump produced checksum error as following. Is this expected or acceptable? - *** * PostgreSQL File/Block Formatted Dump Utility - Version 9.3.0 * * File: /tmp/pgsql/data/base/16384/16397 * Options used: -k * * Dump created on: Sat Jul 6 10:32:15 2013 *** Block0 Header - Block Offset: 0x Offsets: Lower 268 (0x010c) Block: Size 8192 Version4Upper 384 (0x0180) LSN: logid 0 recoff 0x Special 8192 (0x2000) Items: 61 Free Space: 116 Checksum: 0x Prune XID: 0x Flags: 0x () Length (including item array): 268 Error: checksum failure: calculated 0xf797. Data -- Item 1 -- Length: 121 Offset: 8064 (0x1f80) Flags: NORMAL Item 2 -- Length: 121 Offset: 7936 (0x1f00) Flags: NORMAL Item 3 -- Length: 121 Offset: 7808 (0x1e80) Flags: NORMAL - Please check attached script to reproduce it. Also, I have update the help message and README. Please check attached patch. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp PGHOME=/tmp/pgsql PGPORT=15433 function build_check { echo make PGSQL_INCLUDE_DIR=... make PGSQL_INCLUDE_DIR=../../src/include clean make PGSQL_INCLUDE_DIR=../../src/include all make PGSQL_INCLUDE_DIR=../../src/include install make PGSQL_INCLUDE_DIR=../../src/include clean echo make -f Makefile.contrib... make -f Makefile.contrib clean make -f Makefile.contrib all make -f Makefile.contrib install make -f Makefile.contrib clean echo make -f Makefile.contrib USE_PGXS=1... PATH=${PGHOME}/bin:$PATH make -f Makefile.contrib USE_PGXS=1 clean make -f Makefile.contrib USE_PGXS=1 all make -f Makefile.contrib USE_PGXS=1 install make -f Makefile.contrib USE_PGXS=1 clean } function do_builddb { INITDB_OPTS=$1 killall -9 postmaster postgres rm -rf ${PGHOME}/data initdb ${INITDB_OPTS} --no-locale -D ${PGHOME}/data pg_ctl -w -D ${PGHOME}/data start -o -p ${PGPORT} createdb -p ${PGPORT} testdb pgbench -p ${PGPORT} -i testdb psql -A -t -p ${PGPORT} testdbEOF file select '${PGHOME}/data/' || pg_relation_filepath(oid) from pg_class where relname like 'pgbench%'; EOF } function builddb_checksum_enabled { do_builddb -k } function builddb_checksum_disabled { do_builddb } function test_not_verify_checksum { LOG=$1 sed 's/^/pg_filedump /' file _test.sh sh _test.sh $LOG } function test_verify_checksum { LOG=$1 sed 's/^/pg_filedump -k /' file _test.sh sh _test.sh $LOG } build_check builddb_checksum_enabled test_not_verify_checksum test_enabled_not_verify.log test_verify_checksum test_enabled_verify.log builddb_checksum_disabled test_not_verify_checksum test_disabled_not_verify.log test_verify_checksum test_disabled_verify.log diff --git a/contrib/pg_filedump/README.pg_filedump b/contrib/pg_filedump/README.pg_filedump index 63d086f..b3050cd 100644 --- a/contrib/pg_filedump/README.pg_filedump +++ b/contrib/pg_filedump/README.pg_filedump @@ -2,7 +2,7 @@ pg_filedump - Display formatted contents of a PostgreSQL heap, index, or control file. Copyright (c) 2002-2010 Red Hat, Inc. -Copyright (c) 2011-2012, PostgreSQL Global Development Group +Copyright (c) 2011-2013, PostgreSQL Global Development Group This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -59,10 +59,10 @@ not require any manual adjustments of the Makefile. Invocation: -pg_filedump [-abcdfhixy] [-R startblock [endblock]] [-S blocksize] file +pg_filedump [-abcdfhikxy] [-R startblock [endblock]] [-S blocksize] file -Defaults are: relative addressing, range of the entire file, block size - as listed on block 0 in the file +Defaults are: relative addressing, range of the entire file, block
Re: [HACKERS] Block write statistics WIP
Hi, I'm looking into this patch as a reviewer. (2013/05/24 10:33), Heikki Linnakangas wrote: On 23.05.2013 19:10, Greg Smith wrote: On 5/20/13 7:51 AM, Heikki Linnakangas wrote: It strikes me as kind of weird that the read side and write side are not more symmetrical. It might seem weird if you expect the API to be similar to POSIX read() and write(), where you can read() an arbitrary block at any location, and write() an arbitrary block at any location. A better comparison would be e.g open() and close(). open() returns a file descriptor, which you pass to other functions to operate on the file. When you're done, you call close(fd). The file descriptor is completely opaque to the user program, you do all the operations through the functions that take the fd as argument. Similarly, ReadBuffer() returns a Buffer, which is completely opaque to the caller, and you do all the operations through various functions and macros that operate on the Buffer. When you're done, you release the buffer with ReleaseBuffer(). (sorry for the digression from the original topic, I don't have any problem with what adding an optional Relation argument to MarkBuffer if you need that :-) ) Or should we add some pointer, which is accociated with the Relation, into BufferDesc? Maybe OID? I'm thinking of FlushBuffer() too. How can we count physical write for each relation in FlushBuffer()? Or any other appropriate place? BTW, Greg's first patch could not be applied to the latest HEAD. I guess it need to have some fix, I couldn't clafiry it though. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument
(2013/06/17 4:02), Fujii Masao wrote: On Sat, Mar 9, 2013 at 3:23 PM, Satoshi Nagayasu sn...@uptime.jp wrote: It is obviously easy to keep two types of function interfaces, one with regclass-type and another with text-type, in the next release for backward-compatibility like below: pgstattuple(regclass) -- safer interface. pgstattuple(text) -- will be depreciated in the future release. So you're thinking to remove pgstattuple(oid) soon? AFAIK, a regclass type argument would accept an OID value, which means regclass type has upper-compatibility against oid type. So, even if the declaration is changed, compatibility could be kept actually. This test case (in sql/pgstattuple.sql) confirms that. select * from pgstatindex('myschema.test_pkey'::regclass::oid); version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation -+++---+++-+---+--+ 2 | 0 | 8192 | 1 | 0 | 1 | 0 | 0 | 0.79 | 0 (1 row) Having both interfaces for a while would allow users to have enough time to rewrite their applications. Then, we will be able to obsolete (or just drop) old interfaces in the future release, maybe 9.4 or 9.5. I think this approach would minimize an impact of such interface change. So, I think we can clean up function arguments in the pgstattuple module, and also we can have two interfaces, both regclass and text, for the next release. Any comments? In the document, you should mark old functions as deprecated. I'm still considering changing the function name as Tom pointed out. How about pgstatbtindex? In fact, pgstatindex does support only BTree index. So, pgstatbtindex seems to be more appropriate for this function. We can keep having both (old) pgstatindex and (new) pgstatbtindex during next 2-3 major releases, and the old one will be deprecated after that. Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add visibility map information to pg_freespace.
I'm looking into this patch as a reviewer. (2013/06/19 18:03), Simon Riggs wrote: On 19 June 2013 09:19, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: It should useful in other aspects but it seems a bit complicated just to know about visibility bits for certain blocks. With your current patch you can only see the visibility info for blocks in cache, not for all blocks. So while you may think it is useful, it is also unnecessarily limited in scope. Let's just have something that is easy to use that lets us see the visibility state for a block, not just blocks in freespace. I think we can have this visibility map statistics also in pgstattuple function (as a new column) for this purpose. IMHO, we have several modules for different purposes. - pageinspect provies several functions for debugging purpose. - pg_freespace provies a view for monitoring purpose. - pgstattuple provies several functions for collecting specific table/index statistics. So, we can have similar feature in different modules. Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] By now, why PostgreSQL 9.2 don't support SSDs?
2013/03/30 23:31, Bruce Momjian wrote: On Sat, Mar 30, 2013 at 10:08:44PM +0800, 赖文豫 wrote: As we know, SSDs are widely used in various kinds of applications. But the SMGR in PostgreSQL still only support magnetic disk. How do we make full use of SSDs to improve the performance of PostgreSQL? When the storage manager (SMGR) says magnetic disk, it is talking about read/write media with random access capabillity, vs. something like write-only media, which was originally supported in the code. Postgres works just fine with SSDs; the only adjustment you might want to make is to reduce random_page_cost. BTW, using the larger block size (64kB) would improve performance when using SSD drive? I found that configure script supports --with-blocksize option to change the block size up to 32kB. (and the configure script does not support 64kb block size so far.) But I heard that larger block size, like 256kB, would take advantage of the SSD performance because of the block management within SSD. So, I'm just curious to know that. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument
(2013/03/05 22:46), Robert Haas wrote: On Sun, Mar 3, 2013 at 5:32 PM, Tom Lane t...@sss.pgh.pa.us wrote: Maybe this is acceptable collateral damage. I don't know. But we definitely stand a chance of breaking applications if we change pgstatindex like this. It might be better to invent a differently-named function to replace pgstatindex. If this were a built-in function, we might have to make a painful decision between breaking backward compatibility and leaving this broken forever, but as it isn't, we don't. I think your suggestion of adding a new function is exactly right. We can remove the old one in a future release, and support both in the meantime. It strikes me that if extension versioning is for anything, this is it. It is obviously easy to keep two types of function interfaces, one with regclass-type and another with text-type, in the next release for backward-compatibility like below: pgstattuple(regclass) -- safer interface. pgstattuple(text) -- will be depreciated in the future release. Having both interfaces for a while would allow users to have enough time to rewrite their applications. Then, we will be able to obsolete (or just drop) old interfaces in the future release, maybe 9.4 or 9.5. I think this approach would minimize an impact of such interface change. So, I think we can clean up function arguments in the pgstattuple module, and also we can have two interfaces, both regclass and text, for the next release. Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument
Hi, As I wrote before, I'd like to clean up pgstattuple functions to allow the same expressions. Re: [HACKERS] [RFC] pgstattuple/pgstatindex enhancement http://www.postgresql.org/message-id/511ee19b.5010...@uptime.jp My goal is to allow specifying a relation/index with several expressions, 'relname', 'schemaname.relname' and oid in all pgstattuple functions. pgstatindex() does not accept oid so far. I have found that the backward-compatibility can be kept when the arguments (text and/or oid) are replaced with regclass type. regclass type seems to be more appropriate here. So, I cleaned up those three functions, pgstattuple(), pgstatindex(), pg_relpages(), to accept a regclass-type argument, instead of using text and/or oid type, as the test cases show. select * from pgstatindex('test_pkey'); select * from pgstatindex('public.test_pkey'); select * from pgstatindex('myschema.test_pkey'); select * from pgstatindex('myschema.test_pkey'::regclass::oid); With attached patch, all functions in the pgstattuple module can accept the same expression to specify the target relation/index. Any comments or suggestions? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp pgstattuple_regclass_v1.diff.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] pgstattuple/pgstatindex enhancement
(2013/02/15 1:55), Robert Haas wrote: On Tue, Feb 12, 2013 at 10:22 AM, Satoshi Nagayasu sn...@uptime.jp wrote: (1) Fix pgstatindex arguments to work same as pgstattuple. As the document describes, pgstattuple accepts 'schema.table' expression and oid of the table, but pgstatindex doesn't. (because I didn't add that when I created pgstatindex...) http://www.postgresql.org/docs/devel/static/pgstattuple.html So, I'd like to change pgstatindex arguments to allow schema name and oid. Does it make sense? Not sure. It seems nice, but it's also a backward-compatibility break. So I don't know. Yeah, actually, the backward-compatibility issue is the first thing I have considered, and now I think we can keep it. Now, pgstattuple() function accepts following syntax: pgstattuple('table') -- table name (searches in search_path) pgstattuple('schema.table') -- schema and table name pgstattuple(1234) -- oid and pgstatindex() function only accepts below so far: pgstatindex('index') -- index name (searches in search_path) Then, we can easily add new syntax: pgstatindex('schema.index') -- schema and index name pgstatindex(1234) -- oid I think this would allow us to modify pgstatindex() without breaking the backward-compatibility. (2) Enhance pgstattuple/pgstatindex to allow block sampling. Now, we have large tables and indexes in PostgreSQL, and these are growing day by day. pgstattuple and pgstatindex are both very important to keep database performance well, but doing full-scans on large tables and indexes would generate big performance impact. So, now I think pgstattuple and pgstatindex should support 'block sampling' to collect block statistics with avoiding full-scans. With this block sampling feature, pgstattuple/pgstatindex would be able to collect block statistics from 1~10% of the blocks in the table/index if the table/index is large (maybe 10GB or more). Now that sounds really nice. Thanks. I will try it. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [RFC] pgstattuple/pgstatindex enhancement
Hi, I'm thinking of pgstattuple/pgstatindex enhancement. There are a few things I need to change, but I'd like to have some comments and suggestions from hackers before tackling. (1) Fix pgstatindex arguments to work same as pgstattuple. As the document describes, pgstattuple accepts 'schema.table' expression and oid of the table, but pgstatindex doesn't. (because I didn't add that when I created pgstatindex...) http://www.postgresql.org/docs/devel/static/pgstattuple.html So, I'd like to change pgstatindex arguments to allow schema name and oid. Does it make sense? (2) Enhance pgstattuple/pgstatindex to allow block sampling. Now, we have large tables and indexes in PostgreSQL, and these are growing day by day. pgstattuple and pgstatindex are both very important to keep database performance well, but doing full-scans on large tables and indexes would generate big performance impact. So, now I think pgstattuple and pgstatindex should support 'block sampling' to collect block statistics with avoiding full-scans. With this block sampling feature, pgstattuple/pgstatindex would be able to collect block statistics from 1~10% of the blocks in the table/index if the table/index is large (maybe 10GB or more). It would allow us to run pgstattuple/pgstatindex easier. Is it worth having? Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql \l to accept patterns
Hi, I have tried this patch. https://commitfest.postgresql.org/action/patch_view?id=1051 2013/01/29 14:48, Peter Eisentraut wrote: On Mon, 2013-01-07 at 07:14 -0500, Peter Eisentraut wrote: Here is a patch for psql's \l command to accept patterns, like \d commands do. While at it, I also added an S option to show system objects and removed system objects from the default display. This might be a bit controversial, but it's how it was decided some time ago that the \d commands should act. Most people didn't like the S option, so here is a revised patch that just adds the pattern support. It seems working well with the latest git master. I think it's good enough to be committed. BTW, is there any good place to put new regression test for the psql command? I couldn't find it out. Any comment or suggestion? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql \l to accept patterns
(2013/01/30 0:34), Tom Lane wrote: Satoshi Nagayasu sn...@uptime.jp writes: On Mon, 2013-01-07 at 07:14 -0500, Peter Eisentraut wrote: Here is a patch for psql's \l command to accept patterns, like \d BTW, is there any good place to put new regression test for the psql command? I couldn't find it out. As far as a test for this specific feature goes, I'd be against adding one, because it'd be likely to result in random failures in make installcheck mode, depending on what other databases are in the installation. More generally, we've tended to put tests of \d-style psql features together with the relevant backend-side tests. Yes, I think so too. First of all, I was looking for some regression tests for CREATE/ALTER/DROP DATABASE commands, but I couldn't find them in the test/regress/sql/ directory. So, I asked the question. I guess these database tests are in pg_regress.c. Right? The proposed patch to add \gset adds a separate regression test file specifically for psql. I've been debating whether that was worth committing; but if there's near-term interest in adding any more tests for psql features that aren't closely connected to backend features, maybe it's worth having such a file. Personally, I'm interested in having regression tests whatever the target is, because software tends to be more complicated. So, if we reach consensus to have dedicated tests for the psql command (or other client-side commands), I wish to contribute to it. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] pgbench to the MAXINT
Hi, I have reviewed this patch. https://commitfest.postgresql.org/action/patch_view?id=1068 2012/12/21 Gurjeet Singh singh.gurj...@gmail.com: The patch is very much what you had posted, except for a couple of differences due to bit-rot. (i) I didn't have to #define MAX_RANDOM_VALUE64 since its cousin MAX_RANDOM_VALUE is not used by code anymore, and (ii) I used ternary operator in DDLs[] array to decide when to use bigint vs int columns. Please review. As for tests, I am currently running 'pgbench -i -s 21474' using unpatched pgbench, and am recording the time taken;Scale factor 21475 had actually failed to do anything meaningful using unpatched pgbench. Next I'll run with '-s 21475' on patched version to see if it does the right thing, and in acceptable time compared to '-s 21474'. What tests would you and others like to see, to get some confidence in the patch? The machine that I have access to has 62 GB RAM, 16-core 64-hw-threads, and about 900 GB of disk space. I have tested this patch, and hvae confirmed that the columns for aid would be switched to using bigint, instead of int, when the scalefactor = 20,000. (aid columns would exeed the upper bound of int when sf21474.) Also, I added a few fixes on it. - Fixed to apply for the current git master. - Fixed to surpress few more warnings about INT64_FORMAT. - Minor improvement in the docs. (just my suggestion) I attached the revised one. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC http://www.uptime.jp/ pgbench-64-v7.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] buffer assertion tripping under repeat pgbench load
Hi, I just reviewed this patch. https://commitfest.postgresql.org/action/patch_view?id=1035 2013/1/13 Greg Smith g...@2ndquadrant.com: On 12/26/12 7:23 PM, Greg Stark wrote: It's also possible it's a bad cpu, not bad memory. If it affects decrement or increment in particular it's possible that the pattern of usage on LocalRefCount is particularly prone to triggering it. This looks to be the winning answer. It turns out that under extended multi-hour loads at high concurrency, something related to CPU overheating was occasionally flipping a bit. One round of compressed air for all the fans/vents, a little tweaking of the fan controls, and now the system goes 24 hours with no problems. Sorry about all the noise over this. I do think the improved warning messages that came out of the diagnosis ideas are useful. The reworked code must slows down the checking a few cycles, but if you care about performance these assertions are tacked onto the biggest pig around. I added the patch to the January CF as Improve buffer refcount leak warning messages. The sample I showed with the patch submission was a simulated one. Here's the output from the last crash before resolving the issue, where the assertion really triggered: WARNING: buffer refcount leak: [170583] (rel=base/16384/16578, blockNum=302295, flags=0x106, refcount=0 1073741824) WARNING: buffers with non-zero refcount is 1 TRAP: FailedAssertion(!(RefCountErrors == 0), File: bufmgr.c, Line: 1712) This patch is intended to improve warning message at AtEOXact_Buffers(), but I guess another function, AtProcExit_Buffers(), needs to be modified as well. Right? With this additional fix, the patch could be applied to the current git master, and could be compiled with --enable-cassert option. Then, I need some suggestion from hackers to continue this review. How should I reproduce this message for review? This is a debug warning message, so it's not easy for me to reproduce this message. Any suggestion? -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC http://www.uptime.jp/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New statistics for WAL buffer dirty writes
(2012/11/27 7:42), Alvaro Herrera wrote: Satoshi Nagayasu escribió: I attached the latest one, which splits the reset_time for bgwriter and walwriter, and provides new system view, called pg_stat_walwriter, to show the dirty write counter and the reset time. Thanks. I gave this a look and I have a couple of comments: 1. The counter is called dirty_write. I imagine that this comes directly from the name of the nearby DTrace probe, TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START. That probe comes from email 494c1565.3060...@sun.com committed in 4ee79fd20d9a. But there wasn't much discussion about the name back then. Maybe that was fine at the time because it was pretty much an internal matter, being so deep in the code. But we're now going to expose it to regular users, so we'd better choose a very good name because we're going to be stuck with it for a very long time. And the name dirty doesn't ring with me too well; what matters here is not that we're writing a buffer that is dirty, but the fact that we're writing while holding the WalInsertLock, so the name should convey the fact that this is a locked write or something like that. Or at least that's how I see the issue. Note the documentation wording: + entrystructfielddirty_writes//entry + entrytypebigint/type/entry + entryNumber of dirty writes, which means flushing wal buffers + because of its full./entry Yes, dirty_writes came from the probe name, and if it needs to be changed, buffers_flush would make sense for me in this situation, because this counter is intended to show WAL writes due to wal buffer full. 2. Should we put bgwriter and walwriter data in separate structs? I don't think this is necessary, but maybe someone opines differently? I tried to minimize an impact of this patch, but if I can change this struct, yes, I'd like to split into two structs. 3. +/* + * WalWriter global statistics counter. + * Despite its name, this counter is actually used not only in walwriter, + * but also in each backend process to sum up xlog dirty writes. + * Those processes would increment this counter in each XLogWrite call, + * then send it to the stat collector process. + */ +PgStat_MsgWalWriter WalWriterStats; Maybe we should use a different name for the struct, to avoid having to excuse ourselves for the name being wrong ... Ok. How about WalBufferStats? I think this name could be accepted in both the wal writer and each backend process. -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New statistics for WAL buffer dirty writes
(2012/12/10 3:06), Tomas Vondra wrote: On 29.10.2012 04:58, Satoshi Nagayasu wrote: 2012/10/24 1:12, Alvaro Herrera wrote: Satoshi Nagayasu escribi�: With this patch, walwriter process and each backend process would sum up dirty writes, and send it to the stat collector. So, the value could be saved in the stat file, and could be kept on restarting. The statistics could be retreive with using pg_stat_get_xlog_dirty_writes() function, and could be reset with calling pg_stat_reset_shared('walwriter'). Now, I have one concern. The reset time could be captured in globalStats.stat_reset_timestamp, but this value is the same with the bgwriter one. So, once pg_stat_reset_shared('walwriter') is called, stats_reset column in pg_stat_bgwriter does represent the reset time for walwriter, not for bgwriter. How should we handle this? Should we split this value? And should we have new system view for walwriter? I think the answer to the two last questions is yes. It doesn't seem to make sense, to me, to have a single reset timings for what are effectively two separate things. Please submit an updated patch to next CF. I'm marking this one returned with feedback. Thanks. I attached the latest one, which splits the reset_time for bgwriter and walwriter, and provides new system view, called pg_stat_walwriter, to show the dirty write counter and the reset time. I've done a quick review of the v4 patch: Thanks for the review, and sorry for my delayed response. 1) applies fine on HEAD, compiles fine 2) make installcheck fails because of a difference in the 'rules' test suite (there's a new view pg_stat_walwriter - see the attached patch for a fixed version or expected/rules.out) Ah, I forgot about the regression test. I will fix it. Thanks. 3) I do agree with Alvaro that using the same struct for two separate components (bgwriter and walwriter) seems a bit awkward. For example you need to have two separate stat_reset fields, the reset code becomes much more verbose (because you need to list individual fields) etc. So I'd vote to either split this into two structures or keeping it as a single structure (although with two views on top of it). Ok, I will split it into two structs, PgStat_BgWriterGlobalStats and PgStat_WalWriterGlobalStats, and will modify PgStat_GlobalStats to hold those two structs in the stat collector. 4) Are there any other fields that might be interesting? Right now there's just dirty_writes but I guess there are other values. E.g. how much data was actually written etc.? AFAIK, I think those numbers can be obtained by calling pg_current_xlog_insert_location() or pg_current_xlog_location(), but if we need it, I will add it. Regards, Tomas -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Timing Events
(2012/11/03 10:44), Josh Berkus wrote: I don't see all that going into core without a much bigger push than I think people will buy. What people really want for all these is a proper trending system, and that means graphs and dashboards and bling--not a history table. Well, I'm particularly thinking for autoconfiguration. For example, to set vacuum_freeze_min_age properly, you have to know the XID burn rate of the server, which is only available via history. I really don't want to be depending on a graphical monitoring utility to find these things out. This whole approach has the assumption that things are going to fall off sometimes. To expand on that theme for a second, right now I'm more worried about the 99% class of problems. Neither pg_stat_statements nor this idea are very good for tracking the rare rogue problem down. They're both aimed to make things that happen a lot more statistically likely to be seen, by giving an easier UI to glare at them frequently. That's not ideal, but I suspect really fleshing the whole queue consumer - table idea needs to happen to do much better. I'm just concerned that for some types of incidents, it would be much more than 1% *of what you want to look at* which fall off. For example, consider a server which does 95% reads at a very high rate, but has 2% of its writes cronically having lock waits. That's something you want to solve, but it seems fairly probably that these relatively infrequent queries would have fallen off the bottom of pg_stat_statements. Same thing with the relative handful of queries which do large on-disk sorts. The problem I'm worried about is that pg_stat_statements is designed to keep the most frequent queries, but sometimes the thing you really need to look at is not in the list of most frequent queries. I think auto_explain would help you solve such rare incidents if it could dump several statistics into server log, including lock waits and block reads/writes statistic per-session, for example. Do we have something to add to auto_explain? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New statistics for WAL buffer dirty writes
2012/10/24 1:12, Alvaro Herrera wrote: Satoshi Nagayasu escribió: With this patch, walwriter process and each backend process would sum up dirty writes, and send it to the stat collector. So, the value could be saved in the stat file, and could be kept on restarting. The statistics could be retreive with using pg_stat_get_xlog_dirty_writes() function, and could be reset with calling pg_stat_reset_shared('walwriter'). Now, I have one concern. The reset time could be captured in globalStats.stat_reset_timestamp, but this value is the same with the bgwriter one. So, once pg_stat_reset_shared('walwriter') is called, stats_reset column in pg_stat_bgwriter does represent the reset time for walwriter, not for bgwriter. How should we handle this? Should we split this value? And should we have new system view for walwriter? I think the answer to the two last questions is yes. It doesn't seem to make sense, to me, to have a single reset timings for what are effectively two separate things. Please submit an updated patch to next CF. I'm marking this one returned with feedback. Thanks. I attached the latest one, which splits the reset_time for bgwriter and walwriter, and provides new system view, called pg_stat_walwriter, to show the dirty write counter and the reset time. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b4fcbaf..0ae885b 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1907,6 +1907,13 @@ include 'filename' results in most cases. /para + para +When you see pg_stat_walwriter.dirty_write, which means number +of buffer flushing at buffer full, is continuously increasing +in your running server, you may need to enlarge this buffer +size. + /para + /listitem /varlistentry diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 39ccfbb..3117f91 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry + entryOne row only, showing statistics about the wal writer + process's activity. See xref linkend=pg-stat-walwriter-view + for details. + /entry + /row + + row entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry entryOne row per database, showing database-wide statistics. See xref linkend=pg-stat-database-view for details. @@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re single row, containing global data for the cluster. /para + table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter + titlestructnamepg_stat_walwriter/structname View/title + + tgroup cols=3 +thead +row + entryColumn/entry + entryType/entry + entryDescription/entry + /row +/thead + +tbody + row + entrystructfielddirty_writes//entry + entrytypebigint/type/entry + entryNumber of dirty writes, which means flushing wal buffers + because of its full./entry + /row + row + entrystructfieldstats_reset//entry + entrytypetimestamp with time zone/type/entry + entryTime at which these statistics were last reset/entry + /row +/tbody +/tgroup + /table + + para + The structnamepg_stat_walwriter/structname view will always have a + single row, containing global data for the cluster. + /para + table id=pg-stat-database-view xreflabel=pg_stat_database titlestructnamepg_stat_database/structname View/title tgroup cols=3 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d251d08..631a0af 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1347,6 +1347,7 @@ AdvanceXLInsertBuffer(bool new_segment) WriteRqst.Write = OldPageRqstPtr; WriteRqst.Flush = 0; XLogWrite(WriteRqst, false, false); + WalWriterStats.m_xlog_dirty_writes++; LWLockRelease(WALWriteLock); TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 607a72f..40f0c34 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -671,6 +671,11 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; +CREATE VIEW pg_stat_walwriter AS +SELECT
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
2012/10/19 4:36, Robert Haas wrote: On Tue, Oct 16, 2012 at 11:31 AM, Satoshi Nagayasu sn...@uptime.jp wrote: A flight-recorder must not be disabled. Collecting performance data must be top priority for DBA. This analogy is inapposite, though, because a flight recorder rarely crashes the aircraft. If it did, people might have second thoughts about the never disable the flight recorder rule. I have had a couple of different excuses to look into the overhead of timing lately, and it does indeed seem that on many modern Linux boxes even extremely frequent gettimeofday calls produce only very modest amounts of overhead. I agree with that such performance instrument needs to be improved if it has critical performance issue against production environment. So, I'm still looking for a better implementation to decrease performance impact. However, the most important question here is that How can we understand postgresql behavior without looking into tons of source code and hacking skill? Recently, I've been having lots of conversation with database specialists (sales guys and DBAs) trying to use PostgreSQL instead of a commercial database, and they are always struggling with understand PostgreSQL behavior, because no one can observe and/or tell that. Sadly, the situation on Windows doesn't look so good. I don't remember the exact numbers but I think it was something like 40 or 60 or 80 times slower on the Windows box one of my colleagues tested than it is on Linux. And it turns out that that overhead really is measurable and does matter if you do it in a code path that gets run frequently. Of course I am enough of a Linux geek that I don't use Windows myself and curse my fate when I do have to use it, but the reality is that we have a huge base of users who only use PostgreSQL at all because it runs on Windows, and we can't just throw those people under the bus. I think that older platforms like HP/UX likely have problems in this area as well although I confess to not having tested. Do you mean my stat patch should have more performance test on the other platforms? Yes, I agree with that. That having been said, if we're going to do this, this is probably the right approach, because it only calls gettimeofday() in the case where the lock acquisition is contended, and that is a lot cheaper than calling it in all cases. Maybe it's worth finding a platform where pg_test_timing reports that timing is very slow and then measuring how much impact this has on something like a pgbench or pgbench -S workload. We might find that it is in fact negligible. I'm pretty certain that it will be almost if not entirely negligible on Linux but that's not really the case we need to worry about. Thanks for a suggestion for a better implementation. As I mentioned above, I'm always looking for a better idea and solution to meet our purpose. Here, I want to share my concern with you again. PostgreSQL is getting more complicated in order to improve performance and stability, and I think it may be a good news. But also getting more difficult to understand without deep knowledge nowadays, and that would be some bad news actually. From my point of view, that's a huge hurdle to educate DBAs and expand PostgreSQL user base. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
2012/10/19 23:48, Fujii Masao wrote: On Wed, Oct 17, 2012 at 12:31 AM, Satoshi Nagayasu sn...@uptime.jp wrote: 2012/10/16 2:40, Jeff Janes wrote: On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane t...@sss.pgh.pa.us wrote: Satoshi Nagayasu sn...@uptime.jp writes: (2012/10/14 13:26), Fujii Masao wrote: The tracing lwlock usage seems to still cause a small performance overhead even if reporting is disabled. I believe some users would prefer to avoid such overhead even if pg_stat_lwlocks is not available. It should be up to a user to decide whether to trace lwlock usage, e.g., by using trace_lwlock parameter, I think. Frankly speaking, I do not agree with disabling performance instrument to improve performance. DBA must *always* monitor the performance metrix when having such heavy workload. This brings up a question that I don't think has been honestly considered, which is exactly whom a feature like this is targeted at. TBH I think it's of about zero use to DBAs (making the above argument bogus). It is potentially of use to developers, but a DBA is unlikely to be able to do anything about lwlock-level contention even if he has the knowledge to interpret the data. Waiting on BufFreelistLock suggests increasing shared_buffers. Waiting on ProcArrayLock perhaps suggests use of a connection pooler (or does it?) WALWriteLock suggests doing something about IO, either moving logs to different disks, or getting BBU, or something. WALInsertLock suggests trying to adapt your data loading process so it can take advantage of the bulk, or maybe increasing wal_buffers. And a lot of waiting on any of the locks gives a piece of information the DBA can use when asking the mailing lists for help, even if it doesn't allow him to take unilateral action. So I feel it isn't something that should be turned on in production builds. I'd vote for enabling it by a non-default configure option, and making sure that it doesn't introduce any overhead when the option is off. I think hackers would benefit from getting reports from DBAs in the field with concrete data on bottlenecks. If the only way to get this is to do some non-standard compile and deploy it to production, or to create a benchmarking copy of the production database system including a realistic work-load driver and run the non-standard compile there; either of those is going to dramatically cut down on the participation. Agreed. The hardest thing to investigate performance issue is reproducing a situation in the different environment from the production environment. I often see people struggling to reproduce a situation with different hardware and (similar but) different workload. It is very time consuming, and also it often fails. So, we need to collect any piece of information, which would help us to understand what's going on within the production PostgreSQL, without any changes of binaries and configurations in the production environment. That's the reason why I stick to a built-in instrument, and I disagree to disable such instrument even if it has minor performance overhead. A flight-recorder must not be disabled. Collecting performance data must be top priority for DBA. pg_stat_lwlocks seems not adequate 'flight-recorder'. It collects only narrow performance data concerning lwlock. What we should have as 'flight-recorder' is something like Oracle wait event, I think. Not only lwlocks but also all of wait events should be collected for DBA to investigate the performance bottleneck. That's the reason why I said I accept that it's not enough for DBA, and I'm going to work on another lock stats. This idea was proposed by Itagaki-san before. Though he implemented the sampling-profiler patch, it failed to be committed. I'm not sure why not. Yeah, I know the previous patch posted by Itagaki-san. So, I'm questioning why (again) for this time. I think this is very important question because it would be critical in order to involve new DBAs to PostgreSQL. Anyway, I think that this would be more right approach to provide the 'flight-recorder' to DBA. Ok, I guess we have reached the consensus to have some flight-recorder. Right? Actually, it seems a great progress from my point of view. :) Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
2012/10/20 2:45, Tom Lane wrote: Satoshi Nagayasu sn...@uptime.jp writes: Ok, I guess we have reached the consensus to have some flight-recorder. Right? I remain unconvinced. I think the argument that this will promote novice understanding is complete hogwash: making any sense of lwlock-level stats will require deep PG understanding, not create it. And despite Jeff's optimistic views upthread, I don't think very many people have or will acquire such understanding. So I'm really resistant to accepting even minimal overhead in pursuit of this goal --- and I do not believe the overhead will be minimal, either. As I wrote some, I'm actually not pushing lwlock stat itself in this context, I still believe it would be useful though. (I don't think it would be hogwash, because I needed it.) I'm just thinking of how to help DBA understand PostgreSQL behavior without deep dive into the code when he/she faces some kind of performance issue, and also how to make it easier to help newbies by PG experts, including tech support people. As I posted before, I did an investigation on WAL performance when I faced with random commit delays, and I found some problem on the storage device by observing WALInsertLock and WALWriteLock statistic with using SystemTap. How could I do that without SystemTap on the production database? SystemTap would kill performance (more than my patch), and sometimes process, too. Do you really think that we do not need any kind of lock statistic anymore? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
2012/10/16 2:40, Jeff Janes wrote: On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane t...@sss.pgh.pa.us wrote: Satoshi Nagayasu sn...@uptime.jp writes: (2012/10/14 13:26), Fujii Masao wrote: The tracing lwlock usage seems to still cause a small performance overhead even if reporting is disabled. I believe some users would prefer to avoid such overhead even if pg_stat_lwlocks is not available. It should be up to a user to decide whether to trace lwlock usage, e.g., by using trace_lwlock parameter, I think. Frankly speaking, I do not agree with disabling performance instrument to improve performance. DBA must *always* monitor the performance metrix when having such heavy workload. This brings up a question that I don't think has been honestly considered, which is exactly whom a feature like this is targeted at. TBH I think it's of about zero use to DBAs (making the above argument bogus). It is potentially of use to developers, but a DBA is unlikely to be able to do anything about lwlock-level contention even if he has the knowledge to interpret the data. Waiting on BufFreelistLock suggests increasing shared_buffers. Waiting on ProcArrayLock perhaps suggests use of a connection pooler (or does it?) WALWriteLock suggests doing something about IO, either moving logs to different disks, or getting BBU, or something. WALInsertLock suggests trying to adapt your data loading process so it can take advantage of the bulk, or maybe increasing wal_buffers. And a lot of waiting on any of the locks gives a piece of information the DBA can use when asking the mailing lists for help, even if it doesn't allow him to take unilateral action. So I feel it isn't something that should be turned on in production builds. I'd vote for enabling it by a non-default configure option, and making sure that it doesn't introduce any overhead when the option is off. I think hackers would benefit from getting reports from DBAs in the field with concrete data on bottlenecks. If the only way to get this is to do some non-standard compile and deploy it to production, or to create a benchmarking copy of the production database system including a realistic work-load driver and run the non-standard compile there; either of those is going to dramatically cut down on the participation. Agreed. The hardest thing to investigate performance issue is reproducing a situation in the different environment from the production environment. I often see people struggling to reproduce a situation with different hardware and (similar but) different workload. It is very time consuming, and also it often fails. So, we need to collect any piece of information, which would help us to understand what's going on within the production PostgreSQL, without any changes of binaries and configurations in the production environment. That's the reason why I stick to a built-in instrument, and I disagree to disable such instrument even if it has minor performance overhead. A flight-recorder must not be disabled. Collecting performance data must be top priority for DBA. Regards, Cheers, Jeff -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
2012/10/15 1:43, Tom Lane wrote: Satoshi Nagayasu sn...@uptime.jp writes: (2012/10/14 13:26), Fujii Masao wrote: The tracing lwlock usage seems to still cause a small performance overhead even if reporting is disabled. I believe some users would prefer to avoid such overhead even if pg_stat_lwlocks is not available. It should be up to a user to decide whether to trace lwlock usage, e.g., by using trace_lwlock parameter, I think. Frankly speaking, I do not agree with disabling performance instrument to improve performance. DBA must *always* monitor the performance metrix when having such heavy workload. This brings up a question that I don't think has been honestly considered, which is exactly whom a feature like this is targeted at. TBH I think it's of about zero use to DBAs (making the above argument bogus). It is potentially of use to developers, but a DBA is unlikely to be able to do anything about lwlock-level contention even if he has the knowledge to interpret the data. Actually, I'm not a developer. I'm just a DBA, and I needed such instrument when I was asked to investigate storange WAL behavior that produced unexpected/random commit delays under heavy workload. And another patch (WAL dirty flush statistic patch) I have submitted is coming from the same reason. https://commitfest.postgresql.org/action/patch_view?id=893 Unfortunately, since I didn't have such instrument at that time, I used SystemTap to investigate WAL behaviors, including calls and waited time, but using SystemTap was really awful, and I thought PostgreSQL needs to have some built-in instrument for DBA. I needed to determine the bottleneck around WAL, such as lock contension and/or write performance of the device, but I couldn't find anything without an instrument. I accept that I'm focusing on only WAL related lwlocks, and it is not enough for ordinally DBAs, but I still need it to understand PostgreSQL behavior without having deep knowledge and experience on WAL design and implementation. So I feel it isn't something that should be turned on in production builds. I'd vote for enabling it by a non-default configure option, and making sure that it doesn't introduce any overhead when the option is off. There is another option to eliminate performance overhead for this purpose. As I tried in the first patch, instead of reporting through pgstat collector process, each backend could directly increment lwlock counters allocated in the shared memory. http://archives.postgresql.org/message-id/4fe9a6f5.2080...@uptime.jp Here are another benchmark results, including my first patch. [HEAD] number of transactions actually processed: 3439971 tps = 57331.891602 (including connections establishing) tps = 57340.932324 (excluding connections establishing) [My first patch] number of transactions actually processed: 3453745 tps = 57562.196971 (including connections establishing) tps = 57569.197838 (excluding connections establishing) Actually, I'm not sure why my patch makes PostgreSQL faster, :D but the performance seems better than my second patch. I think it still needs some trick to keep counters in pgstat.stat over restarting, but it would be more acceptable in terms of performance overhead. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
(2012/10/14 13:26), Fujii Masao wrote: On Sun, Oct 14, 2012 at 10:46 AM, Satoshi Nagayasu sn...@uptime.jp wrote: HEAD number of transactions actually processed: 3439971 tps = 57331.891602 (including connections establishing) tps = 57340.932324 (excluding connections establishing) snip pg_stat_lwlocks patch (reporting disabled) == number of transactions actually processed: 3429370 tps = 57155.286475 (including connections establishing) tps = 57163.996943 (excluding connections establishing) So, I think some additional hack to reduce reporting is needed. Would it be acceptable in terms of the performance? The tracing lwlock usage seems to still cause a small performance overhead even if reporting is disabled. I believe some users would prefer to avoid such overhead even if pg_stat_lwlocks is not available. It should be up to a user to decide whether to trace lwlock usage, e.g., by using trace_lwlock parameter, I think. Frankly speaking, I do not agree with disabling performance instrument to improve performance. DBA must *always* monitor the performance metrix when having such heavy workload. But it's ok to add a parameter to switch enable/disable it. Any other comments? Another comment is; local_calls/waits/time_ms are really required? I'm not sure how those info would help the performance debugging. I think there are some needs to observe/determine how your test query is affected by the other workload from the other sessions. So, splitting local and shared statistics would be nice to have. Just my thought though. What I don't like is that a session can see only local stats of its own session. It's hard to monitor local stats. Imagine the case where you'd like to monitor local stats of each pgbench session. To monitor such stats, you need to modify pgbench so that its each session monitors its own local stats. Even if you run a monitoring software, it cannot collect those stats because they don't belong to the session that it uses. Ok. I'm waiting more comments from others. Dropping it is easy for me, but any other comments? Josh? Regards, Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
Hi all, I have fixed my previous patch for pg_stat_lwlocks view, and as Josh commented, it now supports local and global (shared) statistics in the same system view. Local statistics means the counters are only effective in the same session, and shared ones means the counters are shared within the entire cluster. Also the global statistics would be collected via pgstat collector process like other statistics do. Now, the global statistics struct has been splitted into two parts for different use, for bgwriter stats and lwlock stats. Therefore, calling pg_stat_reset_shared('bgwriter') or pg_stat_reset_shared('lwlocks') would reset dedicated struct, not entire PgStat_GlobalStats. Comments and review are always welcome. Regards, -- postgres=# SELECT * FROM pg_stat_lwlocks; lwlockid | local_calls | local_waits | local_time_ms | shared_calls | shared_waits | shared_time_ms --+-+-+---+--+--+ 0 | 0 | 0 | 0 | 4268 | 0 | 0 1 | 43 | 0 | 0 | 387 | 0 | 0 2 | 0 | 0 | 0 | 19 | 0 | 0 3 | 0 | 0 | 0 | 28 | 0 | 0 4 | 3 | 0 | 0 | 315 | 0 | 0 5 | 0 | 0 | 0 | 24 | 0 | 0 6 | 1 | 0 | 0 | 76 | 0 | 0 7 | 0 | 0 | 0 |16919 | 0 | 0 8 | 0 | 0 | 0 |0 | 0 | 0 9 | 0 | 0 | 0 |0 | 0 | 0 10 | 0 | 0 | 0 |0 | 0 | 0 11 | 0 | 0 | 0 | 75 | 0 | 0 12 | 0 | 0 | 0 |0 | 0 | 0 13 | 0 | 0 | 0 |0 | 0 | 0 14 | 0 | 0 | 0 |0 | 0 | 0 15 | 0 | 0 | 0 |0 | 0 | 0 16 | 0 | 0 | 0 |0 | 0 | 0 17 | 0 | 0 | 0 |61451 | 6 | 0 18 | 0 | 0 | 0 |0 | 0 | 0 19 | 0 | 0 | 0 |0 | 0 | 0 20 | 0 | 0 | 0 |0 | 0 | 0 21 | 1 | 0 | 0 |9 | 0 | 0 22 | 0 | 0 | 0 |0 | 0 | 0 23 | 0 | 0 | 0 |0 | 0 | 0 24 | 0 | 0 | 0 |1 | 0 | 0 25 | 0 | 0 | 0 |0 | 0 | 0 26 | 2 | 0 | 0 | 18 | 0 | 0 27 | 0 | 0 | 0 |0 | 0 | 0 28 | 0 | 0 | 0 |0 | 0 | 0 29 | 0 | 0 | 0 |0 | 0 | 0 30 | 0 | 0 | 0 |0 | 0 | 0 31 | 0 | 0 | 0 |0 | 0 | 0 32 | 0 | 0 | 0 |0 | 0 | 0 33 | 4 | 0 | 0 | 207953 | 0 | 0 50 | 8 | 0 | 0 |33388 | 0 | 0 67 | 0 | 0 | 0 |0 | 0 | 0 (36 rows) postgres=# -- 2012/06/26 21:11, Satoshi Nagayasu wrote: Hi all, I've modified the pg_stat_lwlocks patch to be able to work with the latest PostgreSQL Git code. This patch provides: pg_stat_lwlocks New system view
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
Hi, 2012/10/13 23:05, Satoshi Nagayasu wrote: Hi all, I have fixed my previous patch for pg_stat_lwlocks view, and as Josh commented, it now supports local and global (shared) statistics in the same system view. Sorry, I found my mistakes. New fixed one is attached to this mail. Regards, Local statistics means the counters are only effective in the same session, and shared ones means the counters are shared within the entire cluster. Also the global statistics would be collected via pgstat collector process like other statistics do. Now, the global statistics struct has been splitted into two parts for different use, for bgwriter stats and lwlock stats. Therefore, calling pg_stat_reset_shared('bgwriter') or pg_stat_reset_shared('lwlocks') would reset dedicated struct, not entire PgStat_GlobalStats. Comments and review are always welcome. Regards, -- postgres=# SELECT * FROM pg_stat_lwlocks; lwlockid | local_calls | local_waits | local_time_ms | shared_calls | shared_waits | shared_time_ms --+-+-+---+--+--+ 0 | 0 | 0 | 0 | 4268 | 0 | 0 1 | 43 | 0 | 0 | 387 | 0 | 0 2 | 0 | 0 | 0 | 19 | 0 | 0 3 | 0 | 0 | 0 | 28 | 0 | 0 4 | 3 | 0 | 0 | 315 | 0 | 0 5 | 0 | 0 | 0 | 24 | 0 | 0 6 | 1 | 0 | 0 | 76 | 0 | 0 7 | 0 | 0 | 0 |16919 | 0 | 0 8 | 0 | 0 | 0 |0 | 0 | 0 9 | 0 | 0 | 0 |0 | 0 | 0 10 | 0 | 0 | 0 |0 | 0 | 0 11 | 0 | 0 | 0 | 75 | 0 | 0 12 | 0 | 0 | 0 |0 | 0 | 0 13 | 0 | 0 | 0 |0 | 0 | 0 14 | 0 | 0 | 0 |0 | 0 | 0 15 | 0 | 0 | 0 |0 | 0 | 0 16 | 0 | 0 | 0 |0 | 0 | 0 17 | 0 | 0 | 0 |61451 | 6 | 0 18 | 0 | 0 | 0 |0 | 0 | 0 19 | 0 | 0 | 0 |0 | 0 | 0 20 | 0 | 0 | 0 |0 | 0 | 0 21 | 1 | 0 | 0 |9 | 0 | 0 22 | 0 | 0 | 0 |0 | 0 | 0 23 | 0 | 0 | 0 |0 | 0 | 0 24 | 0 | 0 | 0 |1 | 0 | 0 25 | 0 | 0 | 0 |0 | 0 | 0 26 | 2 | 0 | 0 | 18 | 0 | 0 27 | 0 | 0 | 0 |0 | 0 | 0 28 | 0 | 0 | 0 |0 | 0 | 0 29 | 0 | 0 | 0 |0 | 0 | 0 30 | 0 | 0 | 0 |0 | 0 | 0 31 | 0 | 0 | 0 |0 | 0 | 0 32 | 0 | 0 | 0 |0 | 0 | 0 33 | 4 | 0 | 0 | 207953 | 0 | 0 50 | 8 | 0 | 0 |33388 | 0 | 0 67 | 0 | 0 | 0 |0 | 0 | 0 (36 rows) postgres
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
Thanks for the review. 2012/10/14 8:55, Michael Paquier wrote: On Sun, Oct 14, 2012 at 6:00 AM, Fujii Masao masao.fu...@gmail.com mailto:masao.fu...@gmail.com wrote: On Sun, Oct 14, 2012 at 3:34 AM, Fujii Masao masao.fu...@gmail.com mailto:masao.fu...@gmail.com wrote: On Sat, Oct 13, 2012 at 11:34 PM, Satoshi Nagayasu sn...@uptime.jp mailto:sn...@uptime.jp wrote: Hi, 2012/10/13 23:05, Satoshi Nagayasu wrote: Hi all, I have fixed my previous patch for pg_stat_lwlocks view, and as Josh commented, it now supports local and global (shared) statistics in the same system view. Sorry, I found my mistakes. New fixed one is attached to this mail. Thanks for revising the patch. Here are the comments: The document needs to be updated. The patch caused the following compile warnings in my machine. pgstat.c:1357: warning: no previous prototype for 'pgstat_report_lwlockstat' postgres.c:3922: warning: implicit declaration of function 'pgstat_report_lwlockstat' pgstatfuncs.c:1854: warning: no previous prototype for 'pg_stat_reset_lwlocks' Oops. I just fixed them. Thanks. In my test, this patch caused the measurable performance overhead. I created the test database by pgbench -s10 and ran pgbench -c8 -j8 -T60 -S. Results are: [HEAD] number of transactions actually processed: 1401369 tps = 23351.375811 (including connections establishing) tps = 23355.900043 (excluding connections establishing) [PATCH] number of transactions actually processed: 1401369 tps = 23351.375811 (including connections establishing) tps = 23355.900043 (excluding connections establishing) Oops! Obviously I copied and pasted the test result wrongly... Here is the right result. [HEAD] number of transactions actually processed: 1401369 tps = 23351.375811 (including connections establishing) tps = 23355.900043 (excluding connections establishing) [PATCH] number of transactions actually processed: 1092400 tps = 18179.498013 tel:18179.498013 (including connections establishing) tps = 18182.450824 tel:18182.450824 (excluding connections establishing) Performance difference is due to only the mutex lock taken? I think it is coming from high-frequent reporting through pgstat collector process, which means calling pgstat_report_lwlocks() at PostgresMain(). diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 585db1a..5ca2c6f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3919,6 +3919,8 @@ PostgresMain(int argc, char *argv[], const char *username) pgstat_report_activity(STATE_IDLE, NULL); } + pgstat_report_lwlockstat(); + ReadyForQuery(whereToSendOutput); send_ready_for_query = false; } When I reduced reporting (or just disabled reporting), it shows that the performance would not be affected by this patch. Here are some additional results of the performance test which is the same one Fujii-san did. HEAD number of transactions actually processed: 3439971 tps = 57331.891602 (including connections establishing) tps = 57340.932324 (excluding connections establishing) pg_stat_lwlocks patch (reporting enabled) = number of transactions actually processed: 2665550 tps = 44425.038125 (including connections establishing) tps = 44430.565651 (excluding connections establishing) pg_stat_lwlocks patch (reporting disabled) == number of transactions actually processed: 3429370 tps = 57155.286475 (including connections establishing) tps = 57163.996943 (excluding connections establishing) pg_stat_lwlocks patch (reporting reduced 1/100) === number of transactions actually processed: 3421879 tps = 57030.660814 (including connections establishing) tps = 57038.950498 (excluding connections establishing) So, I think some additional hack to reduce reporting is needed. Would it be acceptable in terms of the performance? Another comment is; local_calls/waits/time_ms are really required? I'm not sure how those info would help the performance debugging. I think there are some needs to observe/determine how your test query is affected by the other workload from the other sessions. So, splitting local and shared statistics would be nice to have. Just my thought though. Regards, Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org mailto:pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Michael Paquier
Re: [HACKERS] New statistics for WAL buffer dirty writes
Hi, 2012/08/12 7:11, Jeff Janes wrote: On Sat, Jul 28, 2012 at 3:33 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jul 7, 2012 at 9:17 PM, Satoshi Nagayasu sn...@uptime.jp wrote: Hi, Jeff Janes has pointed out that my previous patch could hold a number of the dirty writes only in single local backend, and it could not hold all over the cluster, because the counter was allocated in the local process memory. That's true, and I have fixed it with moving the counter into the shared memory, as a member of XLogCtlWrite, to keep total dirty writes in the cluster. ... The comment XLogCtrlWrite must be protected with WALWriteLock mis-spells XLogCtlWrite. The final patch will need to add a sections to the documentation. Thanks to Robert and Tom for addressing my concerns about the pointer volatility. I think there is enough consensus that this is useful without adding more things to it, like histograms or high water marks. However, I do think we will want to add a way to query for the time of the last reset, as other monitoring features are going that way. Is it OK that the count is reset upon a server restart? pg_stat_bgwriter, for example, does not do that. Unfortunately I think fixing this in an acceptable way will be harder than the entire rest of the patch was. The coding looks OK to me, it applies and builds, and passes make check, and does what it says. I didn't do performance testing, as it is hard to believe it would have a meaningful effect. I'll marked it as waiting on author, for the documentation and reset time. I'd ask a more senior hacker to comment on the durability over restarts. I have rewritten the patch to deal with dirty write statistics through pgstat collector as bgwriter does. Yeah, it's a bit bigger rewrite. With this patch, walwriter process and each backend process would sum up dirty writes, and send it to the stat collector. So, the value could be saved in the stat file, and could be kept on restarting. The statistics could be retreive with using pg_stat_get_xlog_dirty_writes() function, and could be reset with calling pg_stat_reset_shared('walwriter'). Now, I have one concern. The reset time could be captured in globalStats.stat_reset_timestamp, but this value is the same with the bgwriter one. So, once pg_stat_reset_shared('walwriter') is called, stats_reset column in pg_stat_bgwriter does represent the reset time for walwriter, not for bgwriter. How should we handle this? Should we split this value? And should we have new system view for walwriter? Of course, I will work on documentation next. Regards, Cheers, Jeff -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ff56c26..234d568 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1518,6 +1518,7 @@ AdvanceXLInsertBuffer(bool new_segment) WriteRqst.Write = OldPageRqstPtr; WriteRqst.Flush = 0; XLogWrite(WriteRqst, false, false); + WalWriterStats.m_xlog_dirty_writes++; LWLockRelease(WALWriteLock); TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 8389d5c..f031be1 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -125,6 +125,15 @@ char *pgstat_stat_tmpname = NULL; */ PgStat_MsgBgWriter BgWriterStats; +/* + * WalWriter global statistics counter. + * Despite its name, this counter is actually used not only in walwriter, + * but also in each backend process to sum up xlog dirty writes. + * Those processes would increment this counter in each XLogWrite call, + * then send it to the stat collector process. + */ +PgStat_MsgWalWriter WalWriterStats; + /* -- * Local data * -- @@ -279,6 +288,7 @@ static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len); static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len); static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len); static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len); +static void pgstat_recv_walwriter(PgStat_MsgWalWriter *msg, int len); static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len); static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); @@ -1188,6 +1198,8 @@ pgstat_reset_shared_counters(const char *target) if (strcmp(target, bgwriter) == 0) msg.m_resettarget = RESET_BGWRITER; + else if (strcmp(target, walwriter) == 0) + msg.m_resettarget = RESET_WALWRITER; else ereport(ERROR
Re: [HACKERS] [PoC] load balancing in libpq
2012/09/24 1:07, Christopher Browne wrote: We historically have connection pooling as an external thing; with the high degree to which people keep implementing and reimplementing this, I think *something* more than we have ought to be built in. This, with perhaps better implementation, might be an apropos start. Parallel with LDAP: it takes very much this approach, where configuration typically offers a list of LDAP servers. I am not certain if OpenLDAP does round robin on the list, or if it tries targets in order, stopping when it succeeds. A decent debate fits in, there. I could see this being implemented instead via something alongside PGSERVICE; that already offers a well-defined way to capture a registry of connection configuration. Specifying a list of service names would allow the command line configuration to remain short and yet very flexible. Thanks for the comment. As you pointed out, I think it would be a start point to implement new simple load-balancing stuff. That's what I actually intended. My clients often ask me easier way to take advantage of replication and load-balancing. I know there are several considerations to be discussed, such as API compatibility issue, but it would be worth having in the core (or around the core). And I also know many people are struggling with load-balancing and master-failover things for the PostgreSQL replication. If those people are trying implementing their own load-balancing stuff in their apps again and again, it's time to consider implementing it to deliver and/or leverage with the potential of PostgreSQL replication. Regards, On 2012-09-23 10:01 AM, Euler Taveira eu...@timbira.com mailto:eu...@timbira.com wrote: On 23-09-2012 07:50, Satoshi Nagayasu wrote: I have just written the first PoC code to enable load balancing in the libpq library. Your POC is totally broken. Just to point out two problems: (i) semicolon (;) is a valid character for any option in the connection string and (ii) you didn't think about PQsetdb[Login](), PQconnectdbParams() and PQconnectStartParams(). If you want to pursue this idea, you should think a way to support same option multiple times (one idea is host1, host2, etc). Isn't it easier to add support on your application or polling software? -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org mailto:pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_reorg in core?
2012/09/25 0:15, Simon Riggs wrote: On 21 September 2012 08:42, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Sep 21, 2012 at 1:00 PM, Hitoshi Harada umi.tan...@gmail.com wrote: I'm not familiar with pg_reorg, but I wonder why we need a separate program for this task. I know pg_reorg is ok as an external program per se, but if we could optimize CLUSTER (or VACUUM which I'm a little pessimistic about) in the same way, it's much nicer than having additional binary + extension. Isn't it possible to do the same thing above within the CLUSTER command? Maybe CLUSTER .. CONCURRENTLY? CLUSTER might be more adapted in this case as the purpose is to reorder the table. The same technique used by pg_reorg (aka table coupled with triggers) could lower the lock access of the table. Also, it could be possible to control each sub-operation in the same fashion way as CREATE INDEX CONCURRENTLY. By the way, whatever the operation, VACUUM or CLUSTER used, I got a couple of doubts: 1) isn't it be too costly for a core operation as pg_reorg really needs many temporary objects? Could be possible to reduce the number of objects created if added to core though... 2) Do you think the current CLUSTER is enough and are there wishes to implement such an optimization directly in core? For me, the Postgres user interface should include * REINDEX CONCURRENTLY * CLUSTER CONCURRENTLY * ALTER TABLE CONCURRENTLY and also that autovacuum would be expanded to include REINDEX and CLUSTER, renaming it to automaint. The actual implementation mechanism for those probably looks something like pg_reorg, but I don't see it as preferable to include the utility directly into core, though potentially some of the underlying code might be. I think it depends on what trade-off we can see. AFAIK, basically, rebuilding tables and/or indexes has a trade-off between lock-free and disk-space. So, if we have enough disk space to build a temporary table/index when rebuilding a table/index, concurrently would be a great option, and I would love it to have in core. Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PoC] load balancing in libpq
Hi all, I have just written the first PoC code to enable load balancing in the libpq library. This libpq enhancement is intended to allow PostgreSQL users to take advantage of the replication in easier way. With using this patch, PQconnectdb() function accepts multiple connection info strings, and pick one of them by round-robin basis to connect. This example, shown in below, shows that PQconnect() accepts a connection string that contains four different databases (db1~db4) on different servers (dbhost1~dbhost4), and then, connects them in round-robin basis. I know there are several things to be considered, but at first, I need your feedback or comments for this enhancement. Do you think it would be useful? Regards, [snaga@devvm03 libpq_repli]$ cat libpq_lb_test.c #include stdio.h #include libpq-fe.h void _connect() { PGconn *conn; PGresult *res; conn = PQconnectdb(host=dbhost1 dbname=db1 user=snaga; host=dbhost2 dbname=db2 user=snaga; host=dbhost3 dbname=db3 user=snaga; host=dbhost4 dbname=db4 user=snaga); res = PQexec(conn, SELECT current_database()); if ( PQresultStatus(res)==PGRES_TUPLES_OK ) { printf(current_database = %s on %s\n, PQgetvalue(res, 0, 0), PQhost(conn)); } PQfinish(conn); } int main(void) { int i; for (i=0 ; i8 ; i++) _connect(); return 0; } [snaga@devvm03 libpq_repli]$ ./libpq_lb_test current_database = db1 on dbhost1 current_database = db2 on dbhost2 current_database = db3 on dbhost3 current_database = db4 on dbhost4 current_database = db1 on dbhost1 current_database = db2 on dbhost2 current_database = db3 on dbhost3 current_database = db4 on dbhost4 [snaga@devvm03 libpq_repli]$ -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 9eaf410..14e31b6 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -569,6 +569,81 @@ PQconnectStartParams(const char *const * keywords, return conn; } +struct ConninfoRepli { + int n_conninfo; + char *conninfo_array[128]; +}; + +static bool +parse_conninfo_repli(char *conninfo, size_t len, const char *conninfo_repli) +{ + char buf[1024]; + char *conninfo_ptr; + char *end_ptr; + struct ConninfoRepli conninfo_r; + static int conninfo_idx = 0; + + int i; + + memset(conninfo_r, 0, sizeof(struct ConninfoRepli)); + + conninfo_r.n_conninfo = 0; + +#ifdef REPLI_DEBUG + printf(DEBUG: %s\n, conninfo_repli); +#endif + + conninfo_ptr = (char *)conninfo_repli; + + while (1) + { + int len; + + end_ptr = strchr(conninfo_ptr, ';'); + if ( end_ptr ) + len = end_ptr - conninfo_ptr; + else + len = strlen(conninfo_ptr); + + memset(buf, 0, sizeof(buf)); + strncpy(buf, conninfo_ptr, len); + + conninfo_r.conninfo_array[conninfo_r.n_conninfo] = strdup(buf); + conninfo_r.n_conninfo++; + + if ( !end_ptr ) + break; + + conninfo_ptr = end_ptr + 1; + } + +#ifdef REPLI_DEBUG + printf(DEBUG: n_conninfo = %d\n, conninfo_r.n_conninfo); + + for (i=0 ; iconninfo_r.n_conninfo ; i++) + { + printf(DEBUG: %s\n, conninfo_r.conninfo_array[i]); + } +#endif + + strncpy(conninfo, conninfo_r.conninfo_array[conninfo_idx], len); + conninfo_idx++; + +#ifdef REPLI_DEBUG + printf(DEBUG: conninfo = %s\n, conninfo); +#endif + + if ( conninfo_idx = conninfo_r.n_conninfo ) + conninfo_idx = 0; + + for (i=0 ; iconninfo_r.n_conninfo ; i++) + { + free(conninfo_r.conninfo_array[i]); + } + + return true; +} + /* * PQconnectStart * @@ -592,6 +667,7 @@ PGconn * PQconnectStart(const char *conninfo) { PGconn *conn; + char conninfo2[1024]; /* * Allocate memory for the conn structure @@ -601,9 +677,15 @@ PQconnectStart(const char *conninfo) return NULL; /* +* Parse an user-specified conninfo string that contains +* multiple conninfo strings, and pick one by round-robin basis. +*/ + parse_conninfo_repli(conninfo2, sizeof(conninfo2), conninfo); + + /* * Parse the conninfo string */ - if (!connectOptions1(conn, conninfo)) + if (!connectOptions1(conn, conninfo2)) return conn; /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_reorg in core?
2012/09/23 12:37, Greg Sabino Mullane wrote: -BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 I think it's time to consider some *umbrella project* for maintaining several small projects outside the core. Well, that was pgfoundry, and it didn't work out. I'm not sure that is quite analogous to what was being proposed. I read it as more of let's package a bunch of these small utilities together into a single project, such that installing one installs them all (e.g. aptitude install pg_tools), and they all have a single bug tracker, etc. That tracker could be github, of course. Exactly --- I do not care the SCM system though. :) I'm not convinced of the merit of that plan, but that's an alternative interpretation that doesn't involve our beloved pgfoundry. :) For example, xlogdump had not been maintained for 5 years when I picked it up last year. And the latest pg_filedump that supports 9.2 has not been released yet. pg_reorg as well. If those tools are in a single project, it would be easier to keep attention on it. Then, developers can easily build *all of them* at once, fix them, and post any patch on the single mailing list. Actually, it would save developers from waisting their time. From my viewpoint, it's not just a SCM or distributing issue. It's about how to survive for such small projects around the core even if these could not come in the core. Regards, Oh, and -1 for putting it in core. Way too early, and not important enough. - -- Greg Sabino Mullane g...@turnstep.com PGP Key: 0x14964AC8 201209222334 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAlBeg/AACgkQvJuQZxSWSsjL5ACgimT71B4lSb1ELhgMw5EBzAKs xHIAn08vxGzmM6eSmDfZfxlJDTousq7h =KgXW -END PGP SIGNATURE- -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_reorg in core?
(2012/09/22 11:01), sakamoto wrote: (2012/09/22 10:02), Christopher Browne wrote: If the present project is having a tough time doing enhancements, I should think it mighty questionable to try to draw it into core, that presses it towards a group of already very busy developers. On the other hand, if the present development efforts can be made more public, by having them take place in a more public repository, that at least has potential to let others in the community see and participate. There are no guarantees, but privacy is liable to hurt. I wouldn't expect any sudden huge influx of developers, but a steady visible stream of development effort would be mighty useful to a merge into core argument. A *lot* of projects are a lot like this. On the Slony project, we have tried hard to maintain this sort of visibility. Steve Singer, Jan Wieck and I do our individual efforts on git repos visible at GitHub to ensure ongoing efforts aren't invisible inside a corporate repo. It hasn't led to any massive of extra developers, but I am always grateful to see Peter Eisentraut's bug reports. Agreed. What reorg project needs first is transparency, including issue traking, bugs, listup todo items, clearfied release schedules, quarity assurance and so force. Only after all that done, the discussion to put them to core can be started. Until now, reorg is developed and maintained behind corporate repository. But now that its activity goes slow, what I should do as a maintainer is to try development process more public and finds someone to corporate with:) I think it's time to consider some *umbrella project* for maintaining several small projects outside the core. As you pointed out, the problem here is that it's difficult to keep enough eyeballs and development resource on tiny projects outside the core. For examples, NTT OSSC has created lots of tools, but they're facing some difficulties to keep them being maintained because of their development resources. There're diffrent code repositories, different web sites, diffirent issus tracking system and different dev mailing lists, for different small projects. My xlogdump as well. Actually, that's the reason why it's difficult to keep enough eyeballs on small third-party projects. And also the reason why some developers want to push their tools into the core, isn't it? :) To solve this problem, I would like to have some umbrella project. It would be called pg dba utils, or something like this. This umbrella project may contain several third-party tools (pg_reorg, pg_rman, pg_filedump, xlogdump, etc, etc...) as its sub-modules. And also it may have single web site, code repository, issue tracking system and developer mailing list in order to share its development resource for testing, maintening and releasing. I think it would help third-party projects keep enough eyeballs even outside the core. Of course, if a third-party project has faster pace on its development and enough eyeballs to maintain, it's ok to be an independent project. However when a tool have already got matured with less eyeballs, it needs to be merged into this umbrella project. Any comments? Sakamoto -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding probes for smgr
(2012/07/29 12:14), Tom Lane wrote: Peter Geoghegan pe...@2ndquadrant.com writes: On 28 July 2012 17:15, Tom Lane t...@sss.pgh.pa.us wrote: IMV smgr is pretty vestigial. I wouldn't recommend loading more functionality onto that layer, because it's as likely as not that we'll just get rid of it someday. Agreed. I recently found myself reading a paper written by Stonebraker back in the Berkeley days: http://dislab2.hufs.ac.kr/dislab/seminar/2007/ERL-M87-06.pdf This paper appears to have been published in about 1988, and it shows. It's fairly obvious from reading the opening paragraph that the original rationale for the design of the storage manager doesn't hold these days. Of course, it's also obvious from reading the code, since for example there is only one storage manager module. Yeah. There were actually two storage managers in what we inherited from Berkeley, but we soon got rid of the other one as being useless. (IIRC it was meant for magnetic-core memory ... anybody seen any of that lately?) I remember that I had found mm.c when I started looking around the storage manager code, maybe a decade ago. It seemed very interesting, but I've never seen magnetic-core memory itself. :) Anyway, I'm realizing that it's reasonable to add new probes into the md module at this time. Thanks for the comments. I think basically what happened since then is that the functionality Stonebraker et al imagined as being in per-storage-manager code all migrated into the kernel device drivers, or even down into the hardware itself. (SSDs are *way* smarter than the average '80s storage device, and even those were an order of magnitude smarter than what they'd been ten years previously. I used to do device drivers back in the 80's...) There's no longer any good reason to have anything but md.c, which isn't so much a magnetic disk interface as an interface to something that has a Unix block device driver. BTW, I'm still interested in keeping the storage manager architecture theoretically pluggable. From my point of view, current pluggable architecture was designed to serve for different storage manager with by-passing the posix file system API. I agree with that most of recent storage devices can be managed under ordinally file systems, ext3/ext4 or xfs for examples. However, I'm still curious to see new project which will intend to extend storage manager to earn more performance for some special storage devices by by-passing ordinally file system API. For example, Fusion-IO is offering dedicated API for their PCI-Express Flash storage, and I guess the database performace could tremendously benefit from it. (I accept that it's just my curious though. :) It's just a possible option, but keeping the storage architecture pluggable does make sense to me even if the community could give only one storage manager. Regards, This state of affairs sort of reminds me of mcxt.c . The struct MemoryContextData is described as an abstract type that can have multiple implementations, despite the fact that since 2000 (and perhaps earlier), the underlying type is invariably AllocSetContext. I never investigated if that indirection still needs to exist, but I suspect that it too is a candidate for refactoring. Do you agree? Meh. Having invented the MemoryContext interface, I am probably not the best-qualified person to evaluate it objectively. The original thought was that we might have (a) a context type that could allocate storage in shared memory, and/or (b) a context type that could provide better allocation speed at a loss of storage efficiency (eg, lose the ability to pfree individual chunks). Case (a) has never become practical given the inability of SysV-style shared memory to expand at all. I don't know if that might change when/if we switch to some other shmem API. The idea of a different allocation strategy for some usages still seems like something we'll want to do someday, though. regards, tom lane -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Adding probes for smgr
Hi, I'm thinking of adding new probes to trace smgr activities. In this implementation, I just found that md.c has its own probes within it, but I'm wondering why we do not have those probes within the generic smgr routines itself. Which would be a better choice? Any ideas or comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] XLogReader v2
2012/07/24 1:15, Robert Haas wrote: On Mon, Jul 23, 2012 at 12:13 PM, Andres Freund and...@2ndquadrant.com wrote: Could that be fixed by moving the debugging routines into a separate set of files, instead of having them lumped in with the code that applies those xlog records? Its a major effort. Those function use elog(), stringinfo and lots of other stuff... I am hesitant to start working on that. On the other hand - I think an in-core xlogdump would be great and sensible thing; but I can live with using my hacked up version that simply links to the backend... The stringinfo thing has long been an annoyance to me. libpq has PQExpBuffer which is the exact same thing. I don't like that we have two implementations of that in two different code bases, and you have to remember to spell it right depending on where you are. I'm not sure exactly what the best way to fix that is, but it sure is a pain in the neck. Does it make sense to make some static library which can be referred from both the backend and several client utilities, including libpq? Or just a dynamic link be preferred? Despite I do not have a clear idea right now, is it time to start thinking of it? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] XLogReader v2
2012/07/19 19:29, Andres Freund wrote: Hi, Attached is v2 of the patch. Changes are: * more comments * significantly cleaned/simpliefied coded * crc validation * addition of XLogReaderReadOne Definitely needed are: * better validation of records * customizable error handling The first is just work that needs to be done, nothing complicated. The second is a bit more complicated: - We could have an bool had_error and a static char that contains the error message, the caller can handle that as wanted - We could have a callback for error handling I think I prefer the callback solution. The second attached patch is a very, very preliminary xlog dumping utility which currently is more of a debugging facility (as evidenced by the fact that it needs and existing /tmp/xlog directory for writing out data) for the XLogReader. It reuses the builtin xlog dumping logic and thus has to link with backend code. I couldn't find a really sensible way to do this: xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s \012|grep -v /main.o|sed 's/^/..\/..\/.. $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) Perhaps somebody has a better idea? I think having an xlogdump utility in core/contrib would be a good idea now that it can be done without a huge amount of code duplication. I plan to check Satoshi-san's version of xlogdump whether I can crib some of the commandline interface and some code from there. I agree with that we need more sophisticated way to share the code between the backend and several utilities (including xlogdump), but AFAIK, a contrib module must allow to be built *without* the core source tree. Any contrib module must be able to be built with only the header files and the shared libraries when using PGXS. So, it could not assume that it has the core source tree. (If we need to assume that, I think xlogdump needs to be put into the core/bin directory.) On the other hand, I have an issue to improve maintainancability of the duplicated code at the xlogdump project. Gather all the code which has been copied from the core. https://github.com/snaga/xlogdump/issues/26 So, I agree with that we need another way to share the code between the backend and the related utilities. Any good ideas? I have one more concern for putting xlogdump into the core. xlogdump is intended to deliver any new features and enhancements to all the users who are using not only the latest major version, but also older major versions maintained by the community, because xlogdump must be a quite important tool when DBA needs it. In fact, the latest xlogdump is now supporting 5 major versions, from 8.3 to 9.2. https://github.com/snaga/xlogdump/blob/master/README.xlogdump But AFAIK, putting xlogdump into the core/contrib would mean that a source tree of each major version could not have a large modification after each release (or each code freeze, actually). It would mean that the users using older major version could not take advantage of new features and enhancements of the latest xlogdump, but it's not what I wanted, actually. Regards, Greetings, Andres -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] New statistics for WAL buffer dirty writes
Hi all, I've created new patch to get/reset statistics of WAL buffer writes (flushes) caused by WAL buffer full. This patch provides two new functions, pg_stat_get_xlog_dirty_write() and pg_stat_reset_xlog_dirty_write(), which have been designed to determine an appropriate value for WAL buffer size. If this counter is increasing in the production environment, it would mean that the WAL buffer size is too small to hold xlog records generated the transactions. So, you can increase your WAL buffer size to keep xlog records and to reduce WAL writes. I think this patch would not affect to WAL write performance, but still paying attention to it. Any comments or suggestions? Regards, --- [snaga@devvm03 src]$ psql -p 15432 postgres psql (9.3devel) Type help for help. postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 0 (1 row) postgres=# \q [snaga@devvm03 src]$ pgbench -p 15432 -s 10 -c 32 -t 1000 postgres Scale option ignored, using pgbench_branches table count = 10 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 10 query mode: simple number of clients: 32 number of threads: 1 number of transactions per client: 1000 number of transactions actually processed: 32000/32000 tps = 141.937738 (including connections establishing) tps = 142.123457 (excluding connections establishing) [snaga@devvm03 src]$ psql -p 15432 postgres psql (9.3devel) Type help for help. postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 0 (1 row) postgres=# begin; BEGIN postgres=# DELETE FROM pgbench_accounts; DELETE 100 postgres=# commit; COMMIT postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 19229 (1 row) postgres=# SELECT pg_stat_reset_xlog_dirty_write(); pg_stat_reset_xlog_dirty_write (1 row) postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 0 (1 row) postgres=# \q [snaga@devvm03 src]$ --- -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 642c129..df1e6d4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -280,6 +280,11 @@ static XLogRecPtr RedoRecPtr; */ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; +/* + * Counter for WAL dirty buffer writes. + */ +static uint64 WalBufferWriteDirtyCount = 0; + /*-- * Shared-memory data structures for XLOG control * @@ -1513,6 +1518,7 @@ AdvanceXLInsertBuffer(bool new_segment) WriteRqst.Flush = 0; XLogWrite(WriteRqst, false, false); LWLockRelease(WALWriteLock); + WalBufferWriteDirtyCount++; TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } } @@ -10492,3 +10498,15 @@ SetWalWriterSleeping(bool sleeping) xlogctl-WalWriterSleeping = sleeping; SpinLockRelease(xlogctl-info_lck); } + +uint64 +xlog_dirty_write_counter_get() +{ + return WalBufferWriteDirtyCount; +} + +void +xlog_dirty_write_counter_reset() +{ + WalBufferWriteDirtyCount = 0; +} diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 7c0705a..d544a5b 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -117,6 +117,9 @@ extern Datum pg_stat_reset_shared(PG_FUNCTION_ARGS); extern Datum pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS); extern Datum pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS); +extern Datum pg_stat_get_xlog_dirty_write(PG_FUNCTION_ARGS); +extern Datum pg_stat_reset_xlog_dirty_write(PG_FUNCTION_ARGS); + /* Global bgwriter statistics, from bgwriter.c */ extern PgStat_MsgBgWriter bgwriterStats; @@ -1700,3 +1703,16 @@ pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS) PG_RETURN_VOID(); } + +Datum +pg_stat_get_xlog_dirty_write(PG_FUNCTION_ARGS) +{ + PG_RETURN_INT64(xlog_dirty_write_counter_get()); +} + +Datum +pg_stat_reset_xlog_dirty_write(PG_FUNCTION_ARGS) +{ + xlog_dirty_write_counter_reset(); + PG_RETURN_VOID(); +} diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index ec79870..01343b9 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -325,6 +325,9 @@ extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, char ** extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive
Re: [HACKERS] New statistics for WAL buffer dirty writes
2012/07/07 22:07, Euler Taveira wrote: On 07-07-2012 09:00, Satoshi Nagayasu wrote: I've created new patch to get/reset statistics of WAL buffer writes (flushes) caused by WAL buffer full. This new statistic doesn't solve your problem (tune wal_buffers). It doesn't give you the wal_buffers value. It only says hey, I needed more buffers so I write those dirty ones. It doesn't say how many. I would like to have something that says hey, you have 1000 buffers available and you are using 100 buffers (10%). This new statistic is only useful for decreasing the WALWriteLock contention. I agree with that it would not tell the exact number for wal_buffers, but it would help DBA understand what's actually happening around WAL buffers. Also, decreasing the WALWriteLock contention is obviously important for DBA in terms of improving database performance. Actually, that's the reason why I'm working on another statistics. :) http://archives.postgresql.org/pgsql-hackers/2012-06/msg01489.php Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New statistics for WAL buffer dirty writes
Hi, Jeff Janes has pointed out that my previous patch could hold a number of the dirty writes only in single local backend, and it could not hold all over the cluster, because the counter was allocated in the local process memory. That's true, and I have fixed it with moving the counter into the shared memory, as a member of XLogCtlWrite, to keep total dirty writes in the cluster. Regards, 2012/07/07 21:00, Satoshi Nagayasu wrote: Hi all, I've created new patch to get/reset statistics of WAL buffer writes (flushes) caused by WAL buffer full. This patch provides two new functions, pg_stat_get_xlog_dirty_write() and pg_stat_reset_xlog_dirty_write(), which have been designed to determine an appropriate value for WAL buffer size. If this counter is increasing in the production environment, it would mean that the WAL buffer size is too small to hold xlog records generated the transactions. So, you can increase your WAL buffer size to keep xlog records and to reduce WAL writes. I think this patch would not affect to WAL write performance, but still paying attention to it. Any comments or suggestions? Regards, --- [snaga@devvm03 src]$ psql -p 15432 postgres psql (9.3devel) Type help for help. postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 0 (1 row) postgres=# \q [snaga@devvm03 src]$ pgbench -p 15432 -s 10 -c 32 -t 1000 postgres Scale option ignored, using pgbench_branches table count = 10 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 10 query mode: simple number of clients: 32 number of threads: 1 number of transactions per client: 1000 number of transactions actually processed: 32000/32000 tps = 141.937738 (including connections establishing) tps = 142.123457 (excluding connections establishing) [snaga@devvm03 src]$ psql -p 15432 postgres psql (9.3devel) Type help for help. postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 0 (1 row) postgres=# begin; BEGIN postgres=# DELETE FROM pgbench_accounts; DELETE 100 postgres=# commit; COMMIT postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 19229 (1 row) postgres=# SELECT pg_stat_reset_xlog_dirty_write(); pg_stat_reset_xlog_dirty_write (1 row) postgres=# SELECT pg_stat_get_xlog_dirty_write(); pg_stat_get_xlog_dirty_write -- 0 (1 row) postgres=# \q [snaga@devvm03 src]$ --- -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 642c129..893acf8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -370,6 +370,11 @@ typedef struct XLogCtlWrite { int curridx;/* cache index of next block to write */ pg_time_t lastSegSwitchTime; /* time of last xlog segment switch */ + + /* +* Counter for WAL dirty buffer writes. +*/ + uint64 WalBufferWriteDirtyCount; } XLogCtlWrite; /* @@ -1504,6 +1509,8 @@ AdvanceXLInsertBuffer(bool new_segment) } else { + XLogCtlWrite *Write = XLogCtl-Write; + /* * Have to write buffers while holding insert lock. This is * not good, so only write as much as we absolutely must. @@ -1512,6 +1519,10 @@ AdvanceXLInsertBuffer(bool new_segment) WriteRqst.Write = OldPageRqstPtr; WriteRqst.Flush = 0; XLogWrite(WriteRqst, false, false); + /* +* XLogCtrlWrite must be protected with WALWriteLock. +*/ + Write-WalBufferWriteDirtyCount++; LWLockRelease(WALWriteLock); TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } @@ -10492,3 +10503,26 @@ SetWalWriterSleeping(bool sleeping) xlogctl-WalWriterSleeping = sleeping; SpinLockRelease(xlogctl-info_lck); } + +uint64 +xlog_dirty_write_counter_get() +{ + XLogCtlWrite *Write = XLogCtl-Write; + uint64 count; + + LWLockAcquire(WALWriteLock, LW_SHARED); + count = Write-WalBufferWriteDirtyCount; + LWLockRelease
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics
Hi all, I've modified the pg_stat_lwlocks patch to be able to work with the latest PostgreSQL Git code. This patch provides: pg_stat_lwlocks New system view to show lwlock statistics. pg_stat_get_lwlocks() New function to retrieve lwlock statistics. pg_stat_reset_lwlocks() New function to reset lwlock statistics. Please try it out. Regards, 2012/06/26 5:29, Satoshi Nagayasu wrote: Hi all, I've been working on a new system view, pg_stat_lwlocks, to observe LWLock, and just completed my 'proof-of-concept' code that can work with version 9.1. Now, I'd like to know the possibility of this feature for future release. With this patch, DBA can easily determine a bottleneck around lwlocks. -- postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10; lwlockid | calls | waits | time_ms --++---+- 49 | 193326 | 32096 | 23688 8 | 3305 | 133 |1335 2 | 21 | 0 | 0 4 | 135188 | 0 | 0 5 | 57935 | 0 | 0 6 |141 | 0 | 0 7 | 24580 | 1 | 0 3 | 3282 | 0 | 0 1 | 41 | 0 | 0 9 | 3 | 0 | 0 (10 rows) postgres=# -- In this view, 'lwlockid' column represents LWLockId used in the backends. 'calls' represents how many times LWLockAcquire() was called. 'waits' represents how many times LWLockAcquire() needed to wait within it before lock acquisition. 'time_ms' represents how long LWLockAcquire() totally waited on a lwlock. And lwlocks that use a LWLockId range, such as BufMappingLock or LockMgrLock, would be grouped and summed up in a single record. For example, lwlockid 49 in the above view represents LockMgrLock statistics. Now, I know there are some considerations. (1) Performance I've measured LWLock performance both with and without the patch, and confirmed that this patch does not affect the LWLock perfomance at all. pgbench scores with the patch: tps = 900.906658 (excluding connections establishing) tps = 908.528422 (excluding connections establishing) tps = 903.900977 (excluding connections establishing) tps = 910.470595 (excluding connections establishing) tps = 909.685396 (excluding connections establishing) pgbench scores without the patch: tps = 909.096785 (excluding connections establishing) tps = 894.868712 (excluding connections establishing) tps = 910.074669 (excluding connections establishing) tps = 904.022770 (excluding connections establishing) tps = 895.673830 (excluding connections establishing) Of course, this experiment was not I/O bound, and the cache hit ratio was99.9%. (2) Memory space In this patch, I added three new members to LWLock structure as uint64 to collect statistics. It means that those members must be held in the shared memory, but I'm not sure whether it's appropriate. I think another possible option is holding those statistics values in local (backend) process memory, and send them through the stat collector process (like other statistics values). (3) LWLock names (or labels) Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is not easy for DBA to determine actual lock type. So, I want to show LWLock names (or labels), like 'WALWriteLock' or 'LockMgrLock', but how should I implement it? Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 7cc1d41..f832b45 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -658,6 +658,14 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; +CREATE VIEW pg_stat_lwlocks AS +SELECT +S.lwlockid, +S.calls, +S.waits, +S.time_ms +FROM pg_stat_get_lwlocks() AS S; + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 95d4b37..2a2c197 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -32,6 +32,7 @@ #include storage/proc.h #include storage/spin.h +#include sys/time.h /* We use the ShmemLock spinlock to protect LWLockAssign */ extern slock_t *ShmemLock; @@ -46,6 +47,11 @@ typedef struct LWLock PGPROC *head; /* head of list of waiting PGPROCs */ PGPROC *tail; /* tail of list of waiting PGPROCs
[HACKERS] pg_stat_lwlocks view - lwlocks statistics
Hi all, I've been working on a new system view, pg_stat_lwlocks, to observe LWLock, and just completed my 'proof-of-concept' code that can work with version 9.1. Now, I'd like to know the possibility of this feature for future release. With this patch, DBA can easily determine a bottleneck around lwlocks. -- postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10; lwlockid | calls | waits | time_ms --++---+- 49 | 193326 | 32096 | 23688 8 | 3305 | 133 |1335 2 | 21 | 0 | 0 4 | 135188 | 0 | 0 5 | 57935 | 0 | 0 6 |141 | 0 | 0 7 | 24580 | 1 | 0 3 | 3282 | 0 | 0 1 | 41 | 0 | 0 9 | 3 | 0 | 0 (10 rows) postgres=# -- In this view, 'lwlockid' column represents LWLockId used in the backends. 'calls' represents how many times LWLockAcquire() was called. 'waits' represents how many times LWLockAcquire() needed to wait within it before lock acquisition. 'time_ms' represents how long LWLockAcquire() totally waited on a lwlock. And lwlocks that use a LWLockId range, such as BufMappingLock or LockMgrLock, would be grouped and summed up in a single record. For example, lwlockid 49 in the above view represents LockMgrLock statistics. Now, I know there are some considerations. (1) Performance I've measured LWLock performance both with and without the patch, and confirmed that this patch does not affect the LWLock perfomance at all. pgbench scores with the patch: tps = 900.906658 (excluding connections establishing) tps = 908.528422 (excluding connections establishing) tps = 903.900977 (excluding connections establishing) tps = 910.470595 (excluding connections establishing) tps = 909.685396 (excluding connections establishing) pgbench scores without the patch: tps = 909.096785 (excluding connections establishing) tps = 894.868712 (excluding connections establishing) tps = 910.074669 (excluding connections establishing) tps = 904.022770 (excluding connections establishing) tps = 895.673830 (excluding connections establishing) Of course, this experiment was not I/O bound, and the cache hit ratio was 99.9%. (2) Memory space In this patch, I added three new members to LWLock structure as uint64 to collect statistics. It means that those members must be held in the shared memory, but I'm not sure whether it's appropriate. I think another possible option is holding those statistics values in local (backend) process memory, and send them through the stat collector process (like other statistics values). (3) LWLock names (or labels) Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is not easy for DBA to determine actual lock type. So, I want to show LWLock names (or labels), like 'WALWriteLock' or 'LockMgrLock', but how should I implement it? Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff -rc postgresql-9.1.2.orig/src/backend/catalog/postgres.bki postgresql-9.1.2/src/backend/catalog/postgres.bki *** postgresql-9.1.2.orig/src/backend/catalog/postgres.bki 2012-06-20 03:32:46.0 +0900 --- postgresql-9.1.2/src/backend/catalog/postgres.bki 2012-06-26 01:51:52.0 +0900 *** *** 1553,1558 --- 1553,1559 insert OID = 3071 ( pg_xlog_replay_pause 11 10 12 1 0 0 f f f t f v 0 0 2278 _null_ _null_ _null_ _null_ pg_xlog_replay_pause _null_ _null_ _null_ ) insert OID = 3072 ( pg_xlog_replay_resume 11 10 12 1 0 0 f f f t f v 0 0 2278 _null_ _null_ _null_ _null_ pg_xlog_replay_resume _null_ _null_ _null_ ) insert OID = 3073 ( pg_is_xlog_replay_paused 11 10 12 1 0 0 f f f t f v 0 0 16 _null_ _null_ _null_ _null_ pg_is_xlog_replay_paused _null_ _null_ _null_ ) + insert OID = 3764 ( pg_stat_get_lwlocks 11 10 12 1 100 0 f f f f t s 0 0 2249 {20,20,20,20} {o,o,o,o} {lwlockid,calls,waits,time_ms} _null_ pg_stat_get_lwlocks _null_ _null_ _null_ ) insert OID = 2621 ( pg_reload_conf 11 10 12 1 0 0 f f f t f v 0 0 16 _null_ _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ ) insert OID = 2622 ( pg_rotate_logfile 11 10 12 1 0 0 f f f t f v 0 0 16 _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ ) insert OID = 2623 ( pg_stat_file 11 10 12 1 0 0 f f f t f v 1 0 2249 25 {25,20,1184,1184,1184,1184,16} {i,o,o,o,o,o,o} {filename,size,access,modification,change,creation,isdir} _null_ pg_stat_file _null_ _null_ _null_ ) diff -rc postgresql-9.1.2.orig/src/backend/catalog/postgres.description postgresql-9.1.2/src/backend/catalog/postgres.description *** postgresql-9.1.2.orig/src/backend/catalog/postgres.description 2012-06-20 03:32:46.0 +0900
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics
2012/06/26 6:44, Josh Berkus wrote: On 6/25/12 1:29 PM, Satoshi Nagayasu wrote: (1) Performance I've measured LWLock performance both with and without the patch, and confirmed that this patch does not affect the LWLock perfomance at all. This would be my main concern with this patch; it's hard for me to imagine that it has no performance impact *at all*, since trace_lwlocks has quite a noticable one in my experience. However, the answer to that is to submit the patch and let people test. Thanks. I will submit the patch to the CommitFest page with some fixes to be able to work with the latest PostgreSQL on Git. I will remark that it would be far more useful to me if we could also track lwlocks per session. Overall counts are somewhat useful, but more granular counts are even more useful. What period of time does the table cover? Since last reset? Yes. it has not yet been implemented yet since this code is just a PoC one, but it is another design issue which needs to be discussed. To implement it, a new array can be added in the local process memory to hold lwlock statistics, and update counters both in the shared memory and the local process memory at once. Then, the session can retrieve 'per-session' statistics from the local process memory via some dedicated function. Does it make sense? Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics
2012/06/26 7:04, Kevin Grittner wrote: Josh Berkusj...@agliodbs.com wrote: On 6/25/12 1:29 PM, Satoshi Nagayasu wrote: (1) Performance I've measured LWLock performance both with and without the patch, and confirmed that this patch does not affect the LWLock perfomance at all. This would be my main concern with this patch; it's hard for me to imagine that it has no performance impact *at all*, since trace_lwlocks has quite a noticable one in my experience. However, the answer to that is to submit the patch and let people test. I think overhead is going to depend quite a bit on the gettimeofday() implementation, since that is called twice per lock wait. Yes. It's one of my concerns, and what I actually want hackers to test. It looked to me like there was nothing to prevent concurrent updates of the counts while gathering the accumulated values for display. Won't this be a problem on 32-bit builds? Actually, I'd like to know how I can improve my code in a 32bit box. However, unfortunately I don't have any 32bit (physical) box now, so I want someone to test it if it needs to be tested. Please add this to the Open COmmitFest for a proper review: https://commitfest.postgresql.org/action/commitfest_view/open Will submit soon. Thanks. -Kevin Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] log messages for archive recovery progress
2012/01/13 0:13, Robert Haas wrote: On Thu, Jan 12, 2012 at 10:04 AM, Simon Riggssi...@2ndquadrant.com wrote: On Thu, Jan 12, 2012 at 2:13 PM, Robert Haasrobertmh...@gmail.com wrote: On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggssi...@2ndquadrant.com wrote: On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasusn...@uptime.jp wrote: However, I'm a bit afraid that it will confuse DBA if we use restored under the pg_xlog replay context, because we have already used restored that means a WAL file as successfully copied (not replayed) from archive directory into pg_xlog directory under the archive recovery context. So, to determine the status of copying WAL files from archive directory, I think we can use restored, or could not restore on failure. And to determine the status of replaying WAL files in pg_xlog directory (even if a WAL is copied from archive), we have to use recover or replay. Agreed. I can change restored to using, so we have two message types LOG: restored log file 00080047 from archive LOG: using pre-existing log file 00080047 from pg_xlog using seems pretty fuzzy to me. replaying? That was my first thought, but the message relates to which file has been selected, and how. Once it has been selected it will be replayed. The idea is to have the two messages look similar. The original message was restored log file... and says nothing about replaying. We could change the old message (ugh! backwards compatibility alert) LOG: replaying log file 00080047 after restore from archive LOG: replaying log file 00080047 already in pg_xlog which doesn't sound much stronger to me... not sure. Hmm, I don't know. But that phrasing does at least have the advantage of being clearly parallel, which I like. It seems difficult to keep backward compatibility. :) Anyway, how about this one? If we have 47 in archive, and 48 in pg_xlog, (1) LOG: restored log file 00080047 from archive (2) LOG: replaying log file 00080047 (3) LOG: could not restore file 00080048 from archive (4) LOG: replaying log file 00080048 In this case, (4) replying after (3) could not restore from archive would means that it has started replaying a WAL from pg_xlog. And if we have 47 in archive, and do not have 48 in pg_xlog, (5) LOG: restored log file 00080047 from archive (6) LOG: replaying log file 00080047 (7) LOG: could not restore file 00080048 from archive (8) LOG: could not replay file 00080048 In this case, (8) could not replay file after (7) could not restore from archive would means that 48 is not found both archive and pg_xlog, so that the latest WAL would gone. I just got another option in my mind. Telling both two numbers of WAL files, from archive and pg_xlog directory, those have been applied during archive recovery would make sense? How about this one? LOG: XXX file(s) from archive, YYY file(s) from pg_xlog successfully applied. Thanks, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] log messages for archive recovery progress
2012/01/11 19:56, Simon Riggs wrote: 2012/1/11 Euler Taveira de Oliveiraeu...@timbira.com: On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote: [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080046 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080046 [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080047 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080047 cp: cannot stat `/backups/archlog/00080048': No such file or directory [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file 00080048 from archive [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 00080048 What about just 'restored log file 00080048 from pg_xlog' instead of the last two messages? If you can't read from pg_xlog emit 'could not restore file 00080048 from pg_xlog'. Yes, simple is better. We already have a message if the file is not present, we just need one if we switch to using pg_xlog. Please look at this. Thanks for a patch. I agree that simple is better. However, I'm a bit afraid that it will confuse DBA if we use restored under the pg_xlog replay context, because we have already used restored that means a WAL file as successfully copied (not replayed) from archive directory into pg_xlog directory under the archive recovery context. So, to determine the status of copying WAL files from archive directory, I think we can use restored, or could not restore on failure. And to determine the status of replaying WAL files in pg_xlog directory (even if a WAL is copied from archive), we have to use recover or replay. I'll try to revise my proposed log messages again. Thanks, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] log messages for archive recovery progress
Hi, When I look into archive recovery deeply as DBA point of view, I found that it's difficult to know (1) when the recovery process switched reading WAL segments files from archive directory to pg_xlog directory, and (2) whether it succeeded applying the latest WAL segments in the pg_xlog directory. For example, when I had 47 WAL segment in the archive directory and 48 WAL segment in the pg_xlog directory, PostgreSQL said: [2011-12-08 05:59:03 JST] 9003: LOG: restored log file 00080046 from archive [2011-12-08 05:59:03 JST] 9003: LOG: restored log file 00080047 from archive cp: cannot stat `/backups/archlog/00080048': No such file or directory [2011-12-08 05:59:03 JST] 9003: LOG: record with zero length at 0/489F8B74 [2011-12-08 05:59:03 JST] 9003: LOG: redo done at 0/489F8B38 [2011-12-08 05:59:03 JST] 9003: LOG: last completed transaction was at log time 2011-12-08 05:52:01.507063+09 cp: cannot stat `/backups/archlog/0009.history': No such file or directory [2011-12-08 05:59:03 JST] 9003: LOG: selected new timeline ID: 9 [2011-12-08 05:59:03 JST] 9003: LOG: restored log file 0008.history from archive [2011-12-08 05:59:04 JST] 9003: LOG: archive recovery complete I felt it's difficult to determine whether PostgreSQL applied 48 WAL segment in the pg_xlog, or not. So, I want to propose new log messages to show archive log progress as reading WAL segments. With applying my tiny modification, the recovery process reports its progress for each WAL segment file, and also tells switching reading archive directory to pg_xlog directory explicitly as shown below. [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080046 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080046 [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080047 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080047 cp: cannot stat `/backups/archlog/00080048': No such file or directory [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file 00080048 from archive [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 00080048 [2011-12-08 15:14:37 JST] 16758: LOG: record with zero length at 0/489F8B74 [2011-12-08 15:14:37 JST] 16758: LOG: redo done at 0/489F8B38 [2011-12-08 15:14:37 JST] 16758: LOG: last completed transaction was at log time 2011-12-08 05:52:01.507063+09 cp: cannot stat `/backups/archlog/0009.history': No such file or directory [2011-12-08 15:14:37 JST] 16758: LOG: selected new timeline ID: 9 [2011-12-08 15:14:37 JST] 16758: LOG: restored log file 0008.history from archive [2011-12-08 15:14:38 JST] 16758: LOG: archive recovery complete Do you think this is useful? Thanks, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLOCK_STATS
2012/01/07 16:58, Heikki Linnakangas wrote: On 07.01.2012 00:24, Robert Haas wrote: It's been a while since I did any testing with LWLOCK_STATS defined, so I thought it might be about time to do that again and see how things look. Here's how they looked back in July: http://archives.postgresql.org/pgsql-hackers/2011-07/msg01373.php Interesting. A couple of weeks ago I wrote a little patch that's similar to LWLOCK_STATS, but it prints out % of wallclock time that is spent acquiring, releasing, or waiting for a lock. I find that more useful than the counters. Here's the patch, I hope it's useful to others. It uses timer_create() and timer_settime(), so it probably won't work on all platforms, and requires linking with -lrt. I have just written up a systemtap script to observe lock statistics. It shows acquire counts, wait counts and total waiting time for each lwlock every 5 seconds. Screenshot here: http://twitpic.com/83p2cz Is this useful for pg developers? -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] xlogdump 0.4.0
Hi hackers, I'm pleased to announce the latest release of xlogdump. xlogdump is a tool for extracting data from WAL segment files. Here is xlogdump README: https://github.com/snaga/xlogdump/blob/master/xlogdump/README.xlogdump xlogdump was originally developed by Tom Lane and Diogo Biazus around a five years ago, but not be maintained these few years. So, I picked it up and have fixed some problems and enhanced. It has still several issues, but it seems to start working, so it's time to try it again, folks! https://github.com/snaga/xlogdump/downloads xlogdump is now able to be compiled against 8.2~9.0 (it was tough fixes!), so it could be a fun (of course, also useful) stuff for many hackers and users. Please visit the new project repository, and try it. https://github.com/snaga/xlogdump If you have any bug reports and/or comments, please feel free to email me. Thanks in advance, p.s. I really appreciate Tom and Diogo for their previous great work. -- NAGAYASU Satoshi satoshi.nagay...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] top for postgresql (ptop?)
Mark, Very interesting. I'm looking for such tool. Unfortunately, I can't compile it on my Solaris right now, but I hope it will be shipped with PostgreSQL distribution. Mark Wong wrote: Hi everyone, I was playing with converting unixtop (the version of top used in FreeBSD) to only show PostgreSQL processes pulled from the pg_stat_activity table. I have a version that kind of works here: http://pgfoundry.org/frs/download.php/1468/ptop-3.6.1-pre6.tar.gz I've tried it on FreeBSD and Linux, not sure about other platforms though. So it looks a lot like top and can currently do a few simple things like display the current_query from pg_stat_activity for a given process, show the locks held by a process and on which tables, and show the query plan for the current query. It is a ways from polished (not really documented, etc.) but I wanted to see what people thought of a text/curses sort of monitoring tool like this. Maybe something to distribute with PostgreSQL? :) Forgive me if I didn't try out pgtop (the CPAN module.) Regards, Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster -- NAGAYASU Satoshi [EMAIL PROTECTED] Phone: +81-50-5546-2496 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings