Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
On 12 November 2015 at 21:01, Tom Lanewrote: > Dean Rasheed writes: >> These results are based on the attached, updated patch which includes >> a few minor improvements. > > I started to look at this patch, and was immediately bemused by the > comment in estimate_ln_weight: > > /* > * 0.9 <= var <= 1.1 > * > * Its logarithm has a negative weight (possibly very large). Estimate > * it using ln(var) = ln(1+x) = x + O(x^2) ~= x. > */ > > This comment's nonsense of course: ln(0.9) is about -0.105, and ln(1.1) is > about 0.095, which is not even negative much less "very large". That's nonsense. The comment is perfectly correct. It's not saying the logarithm is negative, it's saying that the *weight* of the logarithm is negative. For the examples you cite, the weight is small and negative (around -1), but for inputs closer to 1 it can be large and negative. > We could > just replace that with "For values reasonably close to 1, we can estimate > the result using ln(var) = ln(1+x) ~= x." I am wondering what's the point > though: why not flush this entire branch in favor of always using the > generic case? I rather doubt that on any modern machine two uses of > cmp_var plus init_var/sub_var/free_var are going to be cheaper than the > log() call you save. You would need a different way to special-case 1.0, > but that's not expensive. > No, you're missing the point entirely. Suppose var = 1.1, in which case ln(var) is approximately 1e-21. The code in the first branch uses the approximation ln(var) = ln(1+x) ~= x = var-1 to see that the weight of ln(var) is around -21. You can't do that with the code in second branch just by looking at the first couple of digits of var and doing a floating point approximation, because all you'd see would be 1.000 and your approximation to the logarithm would be orders of magnitude out (as is the case with the code that this function is replacing, which comes with no explanatory comments at all). > Larger questions are > > (1) why the Abs() in the specification of estimate_ln_weight --- that > doesn't comport with the text about "Estimate the weight of the most > significant digit". Yes it does, and the formula is correct. The function returns an estimate for the weight of the logarithm of var. Let L = ln(var), then what we are trying to return is the weight of L. So we don't care about the sign of L, we just need to know it's weight -- i.e., we want approximately log10(Abs(L)) which is log10(Abs(ln(var))) as the comment says. The Abs() is necessary because L might be negative (when 0 < var < 1). > (The branch I'm proposing you remove fails to > honor that anyway.) > No it doesn't. It honours the Abs() by ignoring the sign of x, and just looking at its weight. > (2) what should the behavior be for input == 1, and why? The code > is returning zero, but that seems inconsistent or at least > underdocumented. > ln(1) is 0, which has a weight of 0. I suppose you could argue that technically 0 could have any weight, but looking at the bigger picture, what this function is for is deciding how many digits of precision are needed in intermediate calculations to retain full precision in the final result. When the input is exactly 1, the callers will have nice exact results, and no extra intermediate precision is needed, so returning 0 is quite sensible. I guess adding words to that effect to the comment makes sense, since it clearly wasn't obvious. > (3) if it's throwing an error for zero input, why not for negative > input? I'm not sure that either behavior is within the purview of > this function, anyway. Seems like returning zero might be fine. > > regards, tom lane [Shrug] It doesn't make much difference since both those error cases will be caught a little later on in the callers, but since this function needs to test for an empty digit array in the second branch anyway, it seemed like it might as well report that error there. Returning zero would be fine too. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On 13 November 2015 at 15:22, Amit Kapilawrote: > On Fri, Nov 13, 2015 at 7:59 PM, Thom Brown wrote: >> >> On 13 November 2015 at 13:38, Amit Kapila wrote: >> > On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule >> > >> > wrote: >> >> >> >> >> >> yes - the another little bit unclean in EXPLAIN is number of workers. >> >> If I >> >> understand to the behave, the query is processed by two processes if >> >> workers >> >> in the explain is one. >> >> >> > >> > You are right and I think that is current working model of Gather >> > node which seems okay. I think the more serious thing here >> > is that there is possibility that Explain Analyze can show the >> > number of workers as more than actual workers working for Gather >> > node. We have already discussed that Explain Analyze should >> > the actual number of workers used in query execution, patch for >> > the same is still pending. >> >> This may have already been discussed before, but in a verbose output, >> would it be possible to see the nodes for each worker? >> > > There will be hardly any difference in nodes for each worker and it could > be very long plan for large number of workers. What kind of additional > information you want which can't be shown in current format. For explain plans, not that useful, but it's useful to see how long each worker took for explain analyse. And I imagine as more functionality is added to scan partitions and foreign scans, it will perhaps be more useful when the plans won't be identical. (or would they?) >> >> And perhaps associated PIDs? >> > > Yeah, that can be useful, if others also feel like it is important, I can > look into preparing a patch for the same. Thanks. Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Refactoring of LWLock tranches
On Thu, 12 Nov 2015 14:59:59 -0500 Robert Haaswrote: > On Wed, Nov 11, 2015 at 6:50 AM, Ildus Kurbangaliev > wrote: > > Attached a new version of the patch that moves SLRU tranches and LWLocks to > > SLRU control structs. > > > > `buffer_locks` field now contains LWLocks itself, so we have some economy of > > the memory here. `pstrdup` removed in SimpleLruInit. I didn't > > change names from the previous patch yet, but I don't mind if they'll > > be changed. > > I've committed this with some modifications. I did a little bit of > renaming, and I stored a copy of the SLRU name in shared memory. > Otherwise, we're relying on the fact that a static string will be at > the same backend-private address in every process, which is a > dangerous assumption at best. Thanks! That's very inspiring. I have some questions about next steps on other tranches. First of all, I think we can have two API calls, something like: 1) LWLockRequestTranche(char *tranche_name, int locks_count) 2) LWLockGetTranche(char *tranche_name) LWLockRequestTranche reserve an item in main tranches array in shared memory, and allocates space for name, LWLockTranche and LWLocks. There is first question. It is ok if this call will be from *ShmemSize functions? We keep those requests, calculate a required size in LWLockShmemSize (by moving this call to the end) and create all these tranches in CreateLWLocks. And the second issue. We have tranches created manually by backends (xlog, now SLRU), and they located in their shared memory. I suggest for them just reserving an item in the tranches array with LWLockRequestTranche(NULL, 0). About tranches that are creating after Postgres initialization I suggest don't keep them in the shared memory, just to use a local array like in the current implementation. Their name will not be read, we can determine by id that they are not in shared memory, and show "other" in pg_stat_activity and other places. - Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Nov 13, 2015 at 7:59 PM, Thom Brownwrote: > > On 13 November 2015 at 13:38, Amit Kapila wrote: > > On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule > > wrote: > >> > >> > >> yes - the another little bit unclean in EXPLAIN is number of workers. If I > >> understand to the behave, the query is processed by two processes if workers > >> in the explain is one. > >> > > > > You are right and I think that is current working model of Gather > > node which seems okay. I think the more serious thing here > > is that there is possibility that Explain Analyze can show the > > number of workers as more than actual workers working for Gather > > node. We have already discussed that Explain Analyze should > > the actual number of workers used in query execution, patch for > > the same is still pending. > > This may have already been discussed before, but in a verbose output, > would it be possible to see the nodes for each worker? > There will be hardly any difference in nodes for each worker and it could be very long plan for large number of workers. What kind of additional information you want which can't be shown in current format. > > And perhaps associated PIDs? > Yeah, that can be useful, if others also feel like it is important, I can look into preparing a patch for the same. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On Wed, Nov 11, 2015 at 6:53 AM, Robert Haaswrote: > > I've committed most of this, except for some planner bits that I > didn't like, and after a bunch of cleanup. Instead, I committed the > consider-parallel-v2.patch with some additional planner bits to make > up for the ones I removed from your patch. So, now we have parallel > sequential scan! Pretty cool. All I had to do is mark my slow plperl functions as being parallel safe, and bang, parallel execution of them for seq scans. But, there does seem to be a memory leak. The setup (warning: 20GB of data): create table foobar as select md5(floor(random()*150)::text) as id, random() as volume from generate_series(1,2); set max_parallel_degree TO 8; explain select count(*) from foobar where volume >0.9; QUERY PLAN --- Aggregate (cost=2626202.44..2626202.45 rows=1 width=0) -> Gather (cost=1000.00..2576381.76 rows=19928272 width=0) Number of Workers: 7 -> Parallel Seq Scan on foobar (cost=0.00..582554.56 rows=19928272 width=0) Filter: (volume > '0.9'::double precision) Now running this query leads to an OOM condition: explain (analyze, buffers) select count(*) from foobar where volume >0.9; WARNING: terminating connection because of crash of another server process Running it without the explain also causes the problem. Memory dump looks like at some point before the crash looks like: TopMemoryContext: 62496 total in 9 blocks; 16976 free (60 chunks); 45520 used TopTransactionContext: 8192 total in 1 blocks; 4024 free (8 chunks); 4168 used ExecutorState: 1795153920 total in 223 blocks; 4159872 free (880 chunks); 1790994048 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used other insignificant stuff... I don't have enough RAM for each of 7 workers to use all that much more than 2GB work_mem is 25MB, maintenance work_mem is 64MB Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Foreign Data Wrapper
Writing a Foreign Data Wrapper and interested in isolating the WHERE clause to speed up the access of an indexed file on my filesystem. I'm attempting to understand the inner workings of how the data is retrieved so I'm writing code to just handle one case at the moment: WHERE clause on a single column in the foreign 'table'. SELECT * FROM t WHERE testval = 1 I have this code so far, an implementation of the IterateForeignScan interface. static TupleTableSlot * bIterateForeignScan(ForeignScanState *node) { ... RestrictInfo *rinfo = (RestrictInfo *)node->ss.ps.qual; ... } yet am not familiar with what I need to do to pick apart RestrictInfo in order to gather 'testvar', '=', and '1' separately so I can interpret and pass those through to my file parser. Am I going about this the correct way or is there another path I should follow?
Re: [HACKERS] Freeze avoidance of very large table.
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote: > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggswrote: > > > > On 1 October 2015 at 23:30, Josh Berkus wrote: > >> > >> On 10/01/2015 07:43 AM, Robert Haas wrote: > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao > wrote: > >> >> I wonder how much it's worth renaming only the file extension while > >> >> there are many places where "visibility map" and "vm" are used, > >> >> for example, log messages, function names, variables, etc. > >> > > >> > I'd be inclined to keep calling it the visibility map (vm) even if it > >> > also contains freeze information. > >> > > > What is your main worry about changing the name of this map, is it > about more code churn or is it about that we might introduce new issues > or is it about that people are already accustomed to call this map as > visibility map? Several: * Visibility map is rather descriptive, none of the replacement terms imo come close. Few people will know what a 'freeze' map is. * It increases the size of the patch considerably * It forces tooling that knows about the layout of the database directory to change their tools On the benfit side the only argument I've heard so far is that it allows to disambiguate the format. But, uh, a look at the major version does that just as well, for far less trouble. > It seems to me quite logical for understanding purpose as well. Any new > person who wants to work in this area or is looking into it will always > wonder why this map is named as visibility map even though it contains > information about visibility of page as well as frozen state of page. Being frozen is about visibility as well. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign Data Wrapper
On Fri, Nov 13, 2015 at 1:46 PM, Big Mikewrote: > Writing a Foreign Data Wrapper and interested in isolating the WHERE > clause to speed up the access of an indexed file on my filesystem. I'm > attempting to understand the inner workings of how the data is retrieved so > I'm writing code to just handle one case at the moment: WHERE clause on a > single column in the foreign 'table'. > > SELECT * FROM t WHERE testval = 1 > > I have this code so far, an implementation of the IterateForeignScan > interface. > > static TupleTableSlot * > bIterateForeignScan(ForeignScanState *node) { > ... > RestrictInfo *rinfo = (RestrictInfo *)node->ss.ps.qual; > ... > } > > yet am not familiar with what I need to do to pick apart RestrictInfo in > order to gather 'testvar', '=', and '1' separately so I can interpret and > pass those through to my file parser. > > Am I going about this the correct way or is there another path I should > follow? > I would look at http://multicorn.org/ which gives you a working python framework. You subclass their ForeignDataWrapper class, override the __init__() and execute() functions, and that's about it. The execute() function has a list called quals that would set you up for the filtering you want to do. I would get the foreign data wrapper fully debugged this way before attempting to refactor in C. And when you do finally re-code, you can see what multicorn itself did to implement your filters.
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
Dean Rasheedwrites: > On 12 November 2015 at 21:01, Tom Lane wrote: >> I started to look at this patch, and was immediately bemused by the >> comment in estimate_ln_weight: > That's nonsense. The comment is perfectly correct. It's not saying the > logarithm is negative, it's saying that the *weight* of the logarithm > is negative. Ah, you're right --- I'd gotten confused about the distinction between ln(x) and ln(ln(x)). Nevermind ... Next question: in the additional range-reduction step you added to ln_var, why stop there, ie, what's the rationale for this magic number: if (Abs((x.weight + 1) * DEC_DIGITS) > 10) Seems like we arguably should do this whenever the weight isn't zero, so as to minimize the number of sqrt() steps. (Yes, I see the point about not getting into infinite recursion, but that only says that the threshold needs to be more than 10, not that it needs to be 10^10.) Also, it seems a little odd to put the recursive calculation of ln(10) where you did, rather than down where it's used, ie why not mul_var(result, , result, local_rscale); ln_var(_ten, _10, local_rscale); int64_to_numericvar((int64) pow_10, ); mul_var(_10, , , local_rscale); add_var(result, , result); round_var(result, rscale); As you have it, ln_10 will be calculated with possibly a smaller rscale than is used in this stanza. That might be all right but it seems dubious --- couldn't the lower-precision result leak into digits we care about? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Tue, Nov 3, 2015 at 09:03:49AM +0530, Amit Kapila wrote: > I think in that case we can call it as page info map or page state map, but > I find retaining visibility map name in this case or for future (if we want to > add another bit) as confusing. In-fact if you find "visibility and freeze > map", > as excessively long, then we can change it to "page info map" or "page state > map" now as well. Coming in late here, but the problem with "page info map" is that free space is also page info (how much free space on each page), so "page info map" isn't very descriptive. "page status" or "page state" might make more sense, but even then, free space is a kind of page status/state. What is happening is that broadening the name to cover both visibility and freeze state also encompasses free space. -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
On 13 November 2015 at 18:36, Tom Lanewrote: > Next question: in the additional range-reduction step you added to ln_var, > why stop there, ie, what's the rationale for this magic number: > > if (Abs((x.weight + 1) * DEC_DIGITS) > 10) > > Seems like we arguably should do this whenever the weight isn't zero, > so as to minimize the number of sqrt() steps. (Yes, I see the point > about not getting into infinite recursion, but that only says that > the threshold needs to be more than 10, not that it needs to be 10^10.) > It's a bit arbitrary. There is a tradeoff here -- computing ln(10) is more expensive than doing a sqrt() since the Babylonian algorithm used for sqrt() is quadratically convergent, whereas the Taylor series for ln() converges more slowly. At the default precision, ln(10) is around 7 times slower than sqrt() on my machine, although that will vary with precision, and the sqrt()s increase the local rscale and so they will get slower. Anyway, it seemed reasonable to not do the extra ln() unless it was going to save at least a couple of sqrt()s. > Also, it seems a little odd to put the recursive calculation of ln(10) > where you did, rather than down where it's used, ie why not > > mul_var(result, , result, local_rscale); > > ln_var(_ten, _10, local_rscale); > int64_to_numericvar((int64) pow_10, ); > mul_var(_10, , , local_rscale); > add_var(result, , result); > > round_var(result, rscale); > > As you have it, ln_10 will be calculated with possibly a smaller rscale > than is used in this stanza. That might be all right but it seems dubious > --- couldn't the lower-precision result leak into digits we care about? > Well it still has 8 digits more than the final rscale, so it's unlikely to cause any loss of precision when added to the result and then rounded to rscale. But on the other hand it does look neater to compute it there, right where it's needed. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
Dean Rasheedwrites: > On 13 November 2015 at 18:36, Tom Lane wrote: >> Seems like we arguably should do this whenever the weight isn't zero, >> so as to minimize the number of sqrt() steps. > It's a bit arbitrary. There is a tradeoff here -- computing ln(10) is > more expensive than doing a sqrt() since the Babylonian algorithm used > for sqrt() is quadratically convergent, whereas the Taylor series for > ln() converges more slowly. At the default precision, ln(10) is around > 7 times slower than sqrt() on my machine, although that will vary with > precision, and the sqrt()s increase the local rscale and so they will > get slower. Anyway, it seemed reasonable to not do the extra ln() > unless it was going to save at least a couple of sqrt()s. OK --- I think I miscounted how many sqrt's we could expect to save. One more thing: the approach you used in power_var() of doing a whole separate exp * ln(base) calculation to approximate the result weight seems mighty expensive, even if it is done at minimal precision. Couldn't we get a good-enough approximation using basically numericvar_to_double_no_overflow(exp) * estimate_ln_weight(base) ? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple psql option -c
On Fri, Nov 13, 2015 at 1:54 PM, Catalin Iacobwrote: > So I promised I'd try to document this. I had a look at the proposed > semantics of -C and I think in the patch they're too complicated which > makes explaining them hard. > > My assumptions about behaviour without this patch, from reading the > docs and some experimenting, correct me if I'm wrong: > > 1. psql normally splits its input by ; let's call each piece of the > split a statement > > 2. for every statement resulting after 1, if it's a \ command it's > interpreted internally, else a query with it is sent to the server, > the result is displayed > > 3. 1. and 2. happen when the input comes from a file (-f) or from stdin > > 4. autocommit off changes behaviour in that it sends a BEGIN before > any of the statements after the split in 1 (except for \ commands, > BEGIN or things like VACUUM which don't work within transactions) > > 5. --single-transaction changes behaviour in that it puts a BEGIN > before the whole input (not around each statement) and a COMMIT after > > 6. all of the above DON'T apply for -c which very different things: it > doesn't split and instead it sends everything, in one query to the > backend. The backend can execute such a thing (it splits itself by ;) > except in some cases like SELECT + VACUUM. Since the single query is > effectively a single transaction for the backend -c ignores > --single-transaction and autocommit off. Even more, when executing > such a multiple statement the backend only returns results for the > last statement of the query. > > From the above it seems -c is a different thing altogether while other > behaviour allows 1 input with multiple commands, multiple results and > works the same on stdin and a file. > > So my proposal is: allow a *single* argument for -C and treat its > content *exactly* like the input from stdin or from a file. > > This answers all the questions about interactions with > --single-transaction and autocommit naturally: it behaves exactly like > stdin and -f behave today. And having a single parameter is similar to > having a single file or single stdin. Having multiple -C is also > confusing since it seems the statements in one -C are grouped somehow > and the ones in the next -C are another group so this starts feeling > like there's maybe a transaction per -C group etc. > > Am I missing something or is it that simple? > While not in patch form here is some food for thought. Tweaks to -c to link it with -C 6c6 < Specifies that psql is to execute one --- > Specifies that psql is to execute the 12d11 < 32a32,36 > Furthermore, only a single instance of this parameter is accepted. > Attempting to provide multiple instances will result in the entire > shell command failing. > > 34,35c38,41 the -c string often has unexpected results. Two >better options are available to execute multiple commands in a >controlled manner. You may use the -C option, described next, or >choose to feed multiple commands to psql's Draft -C thoughts -C command(s) --multi-command=command(s) Specifies that psql is to execute one or more command strings, commands, and then exit. This differs from -c in that multiple instances may be present on the same shell command. Also unlike -c, individual -C commands and statements are executed in auto-commit mode. The following pseudo-code example describe the script that is effectively created. psql -C 'SELECT 1;SELECT 2' -C 'SELECT 3;SELECT4' psql EOF BEGIN; SELECT 1; COMMIT; BEGIN; SELECT 2; COMMIT; BEGIN; SELECT 3; COMMIT; BEGIN; SELECT 4; COMMIT; EOF Alternatively the option --single-transaction makes the entire multi-command execute within a single transaction. There is no option to have entire -C commands commit independently of each other; you have to issue separate psql shell commands. Output from the -C command behaves more script-like than -c as each statement within each command is output. As with -c the Start-up files (psqlrc and ~/.psqlrc) are ignored if this option is present on the command-line. One particular motivation for introducing -C is the first command below fails if executed using -c but now there are two equivalent command lines that work. psql -Atq -C "VACUUM FULL foo; SELECT pg_relation_size('foo')" psql -Atq -C "VACUUM FULL foo" -C "SELECT pg_relation_size('foo')"
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
On 13 November 2015 at 21:00, Tom Lanewrote: > BTW, something I find confusing and error-prone is that this patch keeps > on using the term "weight" to refer to numbers expressed in decimal digits > (ie, the approximate log10 of something). Basically everywhere in the > existing code, "weights" are measured in base-NBASE digits, while "scales" > are measured in decimal digits. I've not yet come across anyplace where > you got the units wrong, but it seems like a gotcha waiting to bite the > next hacker. > > I thought for a bit about s/weight/scale/g in the patch, but that is not > le mot juste either, since weight is generally counting digits to the left > of the decimal point while scale is generally counting digits to the > right. > > The best idea that has come to me is to use "dweight" to refer to a weight > measured in decimal digits. Anyone have a better thought? > Yeah, dweight works for me. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] check for interrupts in set_rtable_names
Jeff Janeswrites: > Someone sent my server a deranged query, it tripped my > auto_explain.log_min_duration setting, that hit some kind of > pathological case while assigning aliases, and now it sits > uninterruptibly in set_rtable_names for hours. > Is there any reason we can't check for interrupts in set_rtable_names, > like the attached? There's probably no reason not to do that, but I'd be much more interested in eliminating the slowness to begin with ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
BTW, something I find confusing and error-prone is that this patch keeps on using the term "weight" to refer to numbers expressed in decimal digits (ie, the approximate log10 of something). Basically everywhere in the existing code, "weights" are measured in base-NBASE digits, while "scales" are measured in decimal digits. I've not yet come across anyplace where you got the units wrong, but it seems like a gotcha waiting to bite the next hacker. I thought for a bit about s/weight/scale/g in the patch, but that is not le mot juste either, since weight is generally counting digits to the left of the decimal point while scale is generally counting digits to the right. The best idea that has come to me is to use "dweight" to refer to a weight measured in decimal digits. Anyone have a better thought? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple psql option -c
So I promised I'd try to document this. I had a look at the proposed semantics of -C and I think in the patch they're too complicated which makes explaining them hard. My assumptions about behaviour without this patch, from reading the docs and some experimenting, correct me if I'm wrong: 1. psql normally splits its input by ; let's call each piece of the split a statement 2. for every statement resulting after 1, if it's a \ command it's interpreted internally, else a query with it is sent to the server, the result is displayed 3. 1. and 2. happen when the input comes from a file (-f) or from stdin 4. autocommit off changes behaviour in that it sends a BEGIN before any of the statements after the split in 1 (except for \ commands, BEGIN or things like VACUUM which don't work within transactions) 5. --single-transaction changes behaviour in that it puts a BEGIN before the whole input (not around each statement) and a COMMIT after 6. all of the above DON'T apply for -c which very different things: it doesn't split and instead it sends everything, in one query to the backend. The backend can execute such a thing (it splits itself by ;) except in some cases like SELECT + VACUUM. Since the single query is effectively a single transaction for the backend -c ignores --single-transaction and autocommit off. Even more, when executing such a multiple statement the backend only returns results for the last statement of the query. >From the above it seems -c is a different thing altogether while other behaviour allows 1 input with multiple commands, multiple results and works the same on stdin and a file. So my proposal is: allow a *single* argument for -C and treat its content *exactly* like the input from stdin or from a file. This answers all the questions about interactions with --single-transaction and autocommit naturally: it behaves exactly like stdin and -f behave today. And having a single parameter is similar to having a single file or single stdin. Having multiple -C is also confusing since it seems the statements in one -C are grouped somehow and the ones in the next -C are another group so this starts feeling like there's maybe a transaction per -C group etc. Am I missing something or is it that simple? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigaiwrote: > I'm now designing the parallel feature of Append... > > Here is one challenge. How do we determine whether each sub-plan > allows execution in the background worker context? I've been thinking about these questions for a bit now, and I think we can work on improving Append in multiple phases. The attached patch shows what I have in mind for phase 1. Currently, if you set up an inheritance hierarchy, you might get an Append some of whose children are Gather nodes with Parallel Seq Scans under them, like this: Append -> Gather -> Parallel Seq Scan -> Gather -> Parallel Seq Scan -> Seq Scan This is a crappy plan. Each Gather node will try to launch its own bunch of workers, which sucks. The attached patch lets you get this plan instead: Gather -> Append -> Partial Seq Scan -> Partial Seq Scan -> Partial Seq Scan That's a much better plan. To make this work, this plan introduces a couple of new concepts. Each RelOptInfo gets a new partial_pathlist field, which stores paths that need to be gathered to produce a workable plan. Once we've populated the partial_pathlist with whatever partial paths we know how to generate, we can either push a Gather node on top of one of those partial paths to create a real path; or we can use those partial paths to build more partial paths. The current patch handles only the simplest case: we can build a partial path for an appendrel by appending a partial path for each member rel. But eventually I hope to handle joinrels this way as well: you can join a partial path with an ordinary path for form a partial path for the joinrel. This requires some way of figuring out how many workers to request for the append-path, so this patch also adds a parallel_degree field to the path object, which is 0 for non-parallel things and the number of workers that the path believes to be ideal otherwise. Right now, it just percolates the highest worker count of any child up to the appendrel, which might not be right, especially once the append node knows how to balance workers, but it seems like a reasonable place to start. > Type-A) It can be performed on background worker context and >picked up by multiple worker processes concurrently. > (e.g: Parallel SeqScan) > Type-B) It can be performed on background worker context but > cannot be picked up by multiple worker processes. > (e.g: non-parallel aware node) > Type-C) It should not be performed on background worker context. >(e.g: plan/path node with volatile functions) At the time that we're forming an AppendPath, we can identify what you're calling type-A paths very easily: they're the ones that are coming from the partial_pathlist. We can distinguish between type-B and type-C paths coming from the childrel's pathlist based on the childrel's consider_parallel flag: if it's false, it's type-C, else type-B. At some point, we might need a per-path flag to distinguish cases where a particular path is type-C even though some other plan for that relation might be type-B. But right now that case doesn't arise. Now, of course, it's not good enough to have this information available when we're generating the AppendPath; it has to survive until execution time. But that doesn't mean the paths need to be self-identifying. We could, of course, decide to make them so, and maybe that's the best design, but I'm thinking about another approach: suppose the append node itself, instead of having one list of child plans, has a list of type-A plans, a list of type-B plans, and a list of type-C plans. This actually seems quite convenient, because at execution time, you presumably want the leader to prioritize type-C plans over any of the others, since it's the only one that can execute them, and the workers to prioritize type-B plans, since they can only take one worker each and are thus prone to be the limiting factor on finishing the whole Append. Having the plans divided in advance between these three lists (say, restricted_plans, safe_plans, parallel_plans) makes that easy to implement. Incidentally, I think it's subtly wrong to think of the parallel_aware flag as telling you whether the plan can absorb multiple workers. That's not really what it's for. It's to tell you whether the plan is doing *something* parallel aware - that is, whether its Estimate, InitializeDSM, and InitializeWorker callbacks should do anything. For SeqScan, flipping parallel_aware actually does split the input among all the workers, but for Append it's probably just load balances and for other nodes it might be something else again. The term I'm using to indicate a path/plan that returns only a subset of the results in each worker is "partial". Whether or not a path is partial is, in the design embodied in this patch, indicated both by whether path->parallel_degree > 0 and whether the path is in rel->pathlist or rel->partial_pathlist. -- Robert Haas
[HACKERS] check for interrupts in set_rtable_names
Someone sent my server a deranged query, it tripped my auto_explain.log_min_duration setting, that hit some kind of pathological case while assigning aliases, and now it sits uninterruptibly in set_rtable_names for hours. Is there any reason we can't check for interrupts in set_rtable_names, like the attached? I've tried it on a deranged query and it seems to do the job. A deranged query: perl -le 'print "explain"; print "select * from pgbench_accounts where aid=5 union" foreach 1..5000; print "select * from pgbench_accounts where aid=5 ;"'|psql Cheers, Jeff rtable_names_interrupts.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Nov 13, 2015 at 9:16 PM, Thom Brownwrote: > > On 13 November 2015 at 15:22, Amit Kapila wrote: > > > > There will be hardly any difference in nodes for each worker and it could > > be very long plan for large number of workers. What kind of additional > > information you want which can't be shown in current format. > > For explain plans, not that useful, but it's useful to see how long > each worker took for explain analyse. > The statistics related to buffers, timing and infact rows filtered will be different for each worker, so it sounds sensible to me to display separate plan info for each worker or at least display the same in verbose or some other mode and then display aggregated information at Gather node. The only point that needs more thought is that parallel plans will look big for many number of workers. I think this will need somewhat substantial changes than what is done currently for parallel seq scan, so it is better if others also share their opinion about this form of display of information for parallel queries. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] check for interrupts in set_rtable_names
On Fri, Nov 13, 2015 at 3:13 PM, Tom Lanewrote: > Jeff Janes writes: >> Someone sent my server a deranged query, it tripped my >> auto_explain.log_min_duration setting, that hit some kind of >> pathological case while assigning aliases, and now it sits >> uninterruptibly in set_rtable_names for hours. > >> Is there any reason we can't check for interrupts in set_rtable_names, >> like the attached? > > There's probably no reason not to do that, but I'd be much more interested > in eliminating the slowness to begin with ... I was thinking about that as well, but I don't think that would be back-patchable, at least not the way I was envisioning it. I was thinking of detecting bad cases (had to count to over 10 before finding a novel name, more than 10 times) and then switching from an object-local count, to a global count, for the numbers to add to the name. But that would be a behavior change under some conditions. I think the code says it clearer than prose, so crude first attempt at that attached. Cheers, Jeff rtable_names_faster.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
On 13 November 2015 at 23:10, Tom Lanewrote: > One more thing: the approach you used in power_var() of doing a whole > separate exp * ln(base) calculation to approximate the result weight > seems mighty expensive, even if it is done at minimal precision. > Couldn't we get a good-enough approximation using basically > numericvar_to_double_no_overflow(exp) * estimate_ln_weight(base) ? > I can't see a way to make that work reliably. It would need to be 10^estimate_ln_weight(base) and the problem is that both exp and ln_weight could be too big to fit in double variables, and become HUGE_VAL, losing all precision. An interesting example is the limit of (1+1/x)^x as x approaches infinity which is e (the base of natural logarithms), so in that case both the exponent and ln_weight could be arbitrarily big (well too big for doubles anyway). For example (1+1/1.2e+500)^(1.2e500) = 2.7182818284... Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
On 14 November 2015 at 00:16, Dean Rasheedwrote: > I can't see a way to make that work reliably. It would need to be > 10^estimate_ln_weight(base) and the problem is that both exp and > ln_weight could be too big to fit in double variables, and become > HUGE_VAL, losing all precision. Of course I meant 10^ln_weight could be too big to fit in a double. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()
Dean Rasheedwrites: > On 14 November 2015 at 00:16, Dean Rasheed wrote: >> I can't see a way to make that work reliably. It would need to be >> 10^estimate_ln_weight(base) and the problem is that both exp and >> ln_weight could be too big to fit in double variables, and become >> HUGE_VAL, losing all precision. > Of course I meant 10^ln_weight could be too big to fit in a double. Drat, that's the second time in this review that I've confused ln_weight(x) with ln(x). Time to call it a day ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Wed, Sep 16, 2015 at 11:22 PM, Robert Haaswrote: > On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov > wrote: > > >> I think it's reasonable to consider reporting this data in the PGPROC > >> using a 4-byte integer rather than reporting it through a singe byte > >> in the backend status structure. I believe that addresses the > >> concerns about reporting from auxiliary processes, and it also allows > >> a little more data to be reported. For anything in excess of that, I > >> think we should think rather harder. Most likely, such addition > >> detail should be reported only for certain types of wait events, or on > >> a delay, or something like that, so that the core mechanism remains > >> really, really fast. > > > > That sounds reasonable. There are many pending questions, but it seems like > > step forward to me. > > Great, let's do it. I think we should probably do the work to > separate the non-individual lwlocks into tranches first, though. > One thing that occurred to me in this context is that if we store the wait event information in PGPROC, then can we think of providing the info about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or any other better name) where we can display wait information about all-processes rather than only backends? This will avoid the confusion about breaking the backward compatibility for the current 'waiting' column in pg_stat_activity. pg_stat_waits can have columns: pid - Process Id wait_class_name - Name of the wait class wait class_event - name of the wait event We can extend it later with the information about timing for wait event. Also, if we follow this approach, I think we don't need to store this information in PgBackendStatus. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] pg_rewind and replication slots
Hi. Should pg_rewind ignore pg_replslot dir at all? As it does pg_basebackup, for example. -- May the force be with you… https://simply.name
Re: [HACKERS] Parallel Seq Scan
On 13 November 2015 at 13:38, Amit Kapilawrote: > On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule > wrote: >> >> >> yes - the another little bit unclean in EXPLAIN is number of workers. If I >> understand to the behave, the query is processed by two processes if workers >> in the explain is one. >> > > You are right and I think that is current working model of Gather > node which seems okay. I think the more serious thing here > is that there is possibility that Explain Analyze can show the > number of workers as more than actual workers working for Gather > node. We have already discussed that Explain Analyze should > the actual number of workers used in query execution, patch for > the same is still pending. This may have already been discussed before, but in a verbose output, would it be possible to see the nodes for each worker? e.g. # explain (analyse, buffers, timing, verbose, costs) select count(*) from js where content->'tags'->>'title' like '%de%'; QUERY PLAN -- Aggregate (cost=105557.59..105557.60 rows=1 width=0) (actual time=400.752..400.752 rows=1 loops=1) Output: count(*) Buffers: shared hit=175333 -> Gather (cost=1000.00..104931.04 rows=250621 width=0) (actual time=400.748..400.748 rows=0 loops=1) Output: content Number of Workers: 2 Buffers: shared hit=175333 -> Parallel Seq Scan on public.js (cost=0.00..39434.47 rows=125310 width=0) (actual time=182.256..398.14 rows=0 loops=1) Output: content Filter: (((js.content -> 'tags'::text) ->> 'title'::text) ~~ '%de%'::text) Rows Removed by Filter: 626486 Buffers: shared hit=87666 -> Parallel Seq Scan on public.js (cost=0.00..39434.47 rows=1253101 width=0) (actual time=214.11..325.31 rows=0 loops=1) Output: content Filter: (((js.content -> 'tags'::text) ->> 'title'::text) ~~ '%de%'::text) Rows Removed by Filter: 6264867 Buffers: shared hit=876667 Planning time: 0.085 ms Execution time: 414.713 ms (14 rows) And perhaps associated PIDs? Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On 13 November 2015 at 03:39, Amit Kapilawrote: > On Thu, Nov 12, 2015 at 9:05 PM, Thom Brown wrote: >> >> On 12 November 2015 at 15:23, Amit Kapila wrote: >> > On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule >> > >> > wrote: >> >> >> >> Hi >> >> >> >> I have a first query >> >> >> >> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are >> >> differen >> >> >> > >> > Thanks for the report. The reason for this problem is that >> > instrumentation >> > information from workers is getting aggregated multiple times. In >> > ExecShutdownGatherWorkers(), we call ExecParallelFinish where it >> > will wait for workers to finish and then accumulate stats from workers. >> > Now ExecShutdownGatherWorkers() could be called multiple times >> > (once we read all tuples from workers, at end of node) and it should be >> > ensured that repeated calls should not try to redo the work done by >> > first >> > call. >> > The same is ensured for tuplequeues, but not for parallel executor info. >> > I think we can safely assume that we need to call ExecParallelFinish() >> > only >> > when there are workers started by the Gathers node, so on those lines >> > attached patch should fix the problem. >> >> That fixes the count issue for me, although not the number of buffers >> hit, >> > > The number of shared buffers hit could be different across different runs > because the read sequence of parallel workers can't be guaranteed, also > I don't think same is even guaranteed for Seq Scan node, the other > operations > in parallel could lead to different number, however the actual problem was > that in one of the plans shown by you [1], the Buffers hit at Gather node > (175288) is lesser than the Buffers hit at Parallel Seq Scan node (241201). > Do you still (after applying above patch) see that Gather node is showing > lesser hit buffers than Parallel Seq Scan node? Hmm... that's odd, I'm not seeing the problem now, so maybe I'm mistaken there. > [1] > # explain (analyse, buffers, timing, verbose, costs) select count(*) > from js where content->'tags'->>'title' like '%design%'; > QUERY > PLAN > > Aggregate (cost=132489.34..132489.35 rows=1 width=0) (actual > time=382.987..382.987 rows=1 loops=1) >Output: count(*) >Buffers: shared hit=175288 >-> Gather (cost=1000.00..132488.34 rows=401 width=0) (actual > time=382.983..382.983 rows=0 loops=1) > Output: content > Number of Workers: 1 > Buffers: shared hit=175288 > -> Parallel Seq Scan on public.js (cost=0.00..131448.24 > rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1) >Output: content >Filter: (((js.content -> 'tags'::text) ->> > 'title'::text) ~~ '%design%'::text) >Rows Removed by Filter: 1724810 >Buffers: shared hit=241201 > Planning time: 0.104 ms > Execution time: 403.045 ms > (14 rows) > > Time: 403.596 ms > >> or the actual time taken. >> > > Exactly what time you are referring here, Execution Time or actual time > shown on Parallel Seq Scan node and what problem do you see with > the reported time? I'm referring to the Parallel Seq Scan actual time, showing "379.407..1141.437" with 1 worker, but the total execution time shows 403.045. If one worker is taking over a second, how come the whole query was less than half a second? Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] eXtensible Transaction Manager API
On Thu, Nov 12, 2015 at 11:23 PM, Robert Haaswrote: > On Mon, Nov 9, 2015 at 2:47 PM, Simon Riggs wrote: >> On 9 November 2015 at 18:46, Robert Haas wrote: >>> One point I'd like to mention is that it's absolutely critical to >>> design this in a way that minimizes network roundtrips without >>> compromising correctness. XC's GTM proxy suggests that they failed to >>> do that. I think we really need to look at what's going to be on the >>> other sides of the proposed APIs and think about whether it's going to >>> be possible to have a strong local caching layer that keeps network >>> roundtrips to a minimum. We should consider whether the need for such >>> a caching layer has any impact on what the APIs should look like. >> You mean the caching layer that already exists in XL/XC? > > I don't think that's what I mean, no. If you have an external GTM > Proxy, then you have missed a trick, because whatever caching it does > could be done better inside the process that sent the request to the > proxy. Definitely. Having a single communication channel between the backends and the centralized server that communicate with grouped messages would clearly help enhancing the scalability of the system, though this is not something the DTM should try to solve IMO. >>> For example, consider a 10-node cluster where each node has 32 cores >>> and 32 clients, and each client is running lots of short-running SQL >>> statements. The demand for snapshots will be intense. If every >>> backend separately requests a snapshot for every SQL statement from >>> the coordinator, that's probably going to be terrible. We can make it >>> the problem of the stuff behind the DTM API to figure out a way to >>> avoid that, but maybe that's going to result in every DTM needing to >>> solve the same problems. >> >> The whole purpose of that XTM API is to allow different solutions for that >> to be created. Konstantin has made a very good case for such an API to >> exist, based around 3 markedly different approaches. > > I'm not arguing with that. > >> Whether we allow the API into core to be accessible via extensions is a >> different issue, but it looks fine for its purpose. > > I'm not attacking the API. I'm trying to have a discussion about the > important design issues in this area. FWIW, I just looked at the wiki page regarding the DTM, and the set of routines GTM is actually very close to what XC/XL is doing. Where XC/XL directly forked the code of Postgres to redirect to the GTM when fetching a snapshot, TXID, whatever, DTM is using a set of generic methods to achieve this purpose and replace the in-core calls with a set of custom functions to relocate to the external instance in charge of distributing the transaction data. Other differences are that GTM is extended for global timestamps and global sequence values, but that's not really related to ACID. This just reminded me the following message where I wondered about the possibility to add hooks in those hot code paths... http://www.postgresql.org/message-id/cab7npqtdjf-58wuf-xz01nkj7wf0e+eukggqhd0igvsod4h...@mail.gmail.com -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind and replication slots
On Fri, Nov 13, 2015 at 7:41 PM, Vladimir Borodinwrote: > Should pg_rewind ignore pg_replslot dir at all? As it does pg_basebackup, > for example. To keep the code of pg_rewind simple, the decision has been made to make pg_rewind copy everything, the necessary post-cleanup or pre-process saving, like the target node's postgresql.conf that would get overwritten, being at the charge of the upper application layer calling pg_rewind. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Nov 13, 2015 at 10:56 AM, Robert Haaswrote: > > On Thu, Nov 12, 2015 at 10:39 PM, Amit Kapila wrote: > > The number of shared buffers hit could be different across different runs > > because the read sequence of parallel workers can't be guaranteed, also > > I don't think same is even guaranteed for Seq Scan node, > > The number of hits could be different. However, it seems like any > sequential scan, parallel or not, should have a number of accesses > (hit + read) equal to the size of the relation. Not sure if that's > what is happening here. > After patch provided above to fix the issue reported by Pavel, that is the behaviour, but I think there are few more things which we might want to consider, just refer the below plan: Total pages in table postgres=# select relname,relpages from pg_class where relname like '%t2%'; relname | relpages -+-- t2 | 5406 (1 row) Parallel Plan - postgres=# explain (analyze,buffers,timing) select count(*) from t2 where c1 % 1 0 = 0; QUERY PLAN Aggregate (cost=8174.90..8174.91 rows=1 width=0) (actual time=1055.294..1055.2 94 rows=1 loops=1) Buffers: shared hit=446 read=5054 -> Gather (cost=0.00..8162.40 rows=5000 width=0) (actual time=79.787..959.6 51 rows=10 loops=1) Number of Workers: 2 Buffers: shared hit=446 read=5054 -> Parallel Seq Scan on t2 (cost=0.00..8162.40 rows=5000 width=0) (ac tual time=30.771..2518.844 rows=10 loops=1) Filter: ((c1 % 10) = 0) Rows Removed by Filter: 90 Buffers: shared hit=352 read=5054 Planning time: 0.170 ms Execution time: 1059.400 ms (11 rows) Lets focus on Buffers and actual time in the above plan: Buffers - At Parallel Seq Scan node, it shows total of 5406 (352+5054) buffers which tallys with what is expected. However at Gather node, it shows 5500 (446+5054) and the reason for the same is that we accumulate overall buffer usage for parallel execution of worker which includes start of node as well, refer ParallelQueryMain() and when the that gets counted even towards the buffer calculation of Gather node. The theory behind collecting overall buffer usage for parallel execution was that we need it for pg_stat_statements where the stats is accumulated for overall execution not on node-by-node basis refer queryDesc->totaltime usage in standard_ExecutorRun(). I think here we need to decide what is the right value to display at Gather node: 1. Display the same number of buffers at Gather node as at Parallel Seq Scan node. 2. Display the number of buffers at Parallel Seq Scan node plus the additional buffers used by parallel workers for miscellaneous work like ExecutorStart(), etc. 3. Don't account for buffers used for parallel workers. 4. Anything better? Also in conjuction with above, we need to see what should be accounted for pg_stat_statements? actual_time - Actual time at Gather node: actual time = 79.787..959.651 Actual time at Parallel Seq Scan node = 30.771..2518.844 Time at Parallel Seq Scan node is more than time at Gather node as the time for parallel workers is also accumulated for Parallel Seq Scan node, whereas some doesn't get accounted for Gather node. Now it could be confusing for users because time displayed at Parallel Seq Scan node will be equal to - time_taken_by_worker-1 + time_taken_by_worker-2 + ... This time could be more than the actual time taken by query because each worker is getting executed parallely but we have accounted the time such that each one is executing serially. I think the time for fetching the tuples from workers is already accounted for Gather node, so may be for Parallel Seq Scan node we can omit adding the time for each of the parallel workers. Thoughts? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehulewrote: > > yes - the another little bit unclean in EXPLAIN is number of workers. If I > understand to the behave, the query is processed by two processes if > workers in the explain is one. > > You are right and I think that is current working model of Gather node which seems okay. I think the more serious thing here is that there is possibility that Explain Analyze can show the number of workers as more than actual workers working for Gather node. We have already discussed that Explain Analyze should the actual number of workers used in query execution, patch for the same is still pending. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] eXtensible Transaction Manager API
On Tue, Nov 10, 2015 at 3:46 AM, Robert Haaswrote: > On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier > wrote: >> Sure. Now imagine that the pg_twophase entry is corrupted for this >> transaction on one node. This would trigger a PANIC on it, and >> transaction would not be committed everywhere. > > If the database is corrupted, there's no way to guarantee that > anything works as planned. This is like saying that criticizing > somebody's disaster recovery plan on the basis that it will be > inadequate if the entire planet earth is destroyed. As well as there could be FS, OS, network problems... To come back to the point, my point is simply that I found surprising the sentence of Konstantin upthread saying that if commit fails on some of the nodes we should rollback the prepared transaction on all nodes. In the example given, in the phase after calling dtm_end_prepare, say we perform COMMIT PREPARED correctly on node 1, but then failed it on node 2 because a meteor has hit a server, it seems that we cannot rollback, instead we had better rolling in a backup and be sure that the transaction gets committed. How would you rollback the transaction already committed on node 1? But perhaps I missed something... >> I am aware of the fact >> that by definition PREPARE TRANSACTION ensures that a transaction will >> be committed with COMMIT PREPARED, just trying to see any corner cases >> with the approach proposed. The DTM approach is actually rather close >> to what a GTM in Postgres-XC does :) > > Yes. I think that we should try to learn as much as possible from the > XC experience, but that doesn't mean we should incorporate XC's fuzzy > thinking about 2PC into PG. We should not. Yep. > One point I'd like to mention is that it's absolutely critical to > design this in a way that minimizes network roundtrips without > compromising correctness. XC's GTM proxy suggests that they failed to > do that. I think we really need to look at what's going to be on the > other sides of the proposed APIs and think about whether it's going to > be possible to have a strong local caching layer that keeps network > roundtrips to a minimum. We should consider whether the need for such > a caching layer has any impact on what the APIs should look like. At this time, the number of round trips needed particularly for READ COMMITTED transactions that need a new snapshot for each query was really a performance killer. We used DBT-1 (TPC-W) which is less OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit was quickly reached with 10-20 nodes.. > For example, consider a 10-node cluster where each node has 32 cores > and 32 clients, and each client is running lots of short-running SQL > statements. The demand for snapshots will be intense. If every > backend separately requests a snapshot for every SQL statement from > the coordinator, that's probably going to be terrible. We can make it > the problem of the stuff behind the DTM API to figure out a way to > avoid that, but maybe that's going to result in every DTM needing to > solve the same problems. This recalls a couple of things, though in 2009 I was not playing with servers of this scale. > Another thing that I think we need to consider is fault-tolerance. > For example, suppose that transaction commit and snapshot coordination > services are being provided by a central server which keeps track of > the global commit ordering. When that server gets hit by a freak bold > of lightning and melted into a heap of slag, somebody else needs to > take over. Again, this would live below the API proposed here, but I > think it really deserves some thought before we get too far down the > path. XC didn't start thinking about how to add fault-tolerance until > quite late in the project, I think, and the same could be said of > PostgreSQL itself: some newer systems have easier-to-use fault > tolerance mechanisms because it was designed in from the beginning. > Distributed systems by nature need high availability to a far greater > degree than single systems, because when there are more nodes, node > failures are more frequent. Your memories on the matter are right. In the case of XC, the SPOF that is GTM has been somewhat made more stable with the creation of a GTM standby, though it did not solve the scalability limit of having a centralized snapshot facility. It actually increased the load on the whole system because for short transactions. Alea jacta est. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Nov 13, 2015 at 6:17 PM, Thom Brownwrote: > > > > > The number of shared buffers hit could be different across different runs > > because the read sequence of parallel workers can't be guaranteed, also > > I don't think same is even guaranteed for Seq Scan node, the other > > operations > > in parallel could lead to different number, however the actual problem was > > that in one of the plans shown by you [1], the Buffers hit at Gather node > > (175288) is lesser than the Buffers hit at Parallel Seq Scan node (241201). > > Do you still (after applying above patch) see that Gather node is showing > > lesser hit buffers than Parallel Seq Scan node? > > Hmm... that's odd, I'm not seeing the problem now, so maybe I'm mistaken there. > Thanks for confirming the same. > > > >> or the actual time taken. > >> > > > > Exactly what time you are referring here, Execution Time or actual time > > shown on Parallel Seq Scan node and what problem do you see with > > the reported time? > > I'm referring to the Parallel Seq Scan actual time, showing > "379.407..1141.437" with 1 worker, but the total execution time shows > 403.045. If one worker is taking over a second, how come the whole > query was less than half a second? > Yeah, this could be possible due to the way currently time is accumulated, see my mail which I sent just before this mail. I think we might need to do something, else it could be confusing for users. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com