Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Dean Rasheed
On 12 November 2015 at 21:01, Tom Lane  wrote:
> Dean Rasheed  writes:
>> These results are based on the attached, updated patch which includes
>> a few minor improvements.
>
> I started to look at this patch, and was immediately bemused by the
> comment in estimate_ln_weight:
>
> /*
>  * 0.9 <= var <= 1.1
>  *
>  * Its logarithm has a negative weight (possibly very large). Estimate
>  * it using ln(var) = ln(1+x) = x + O(x^2) ~= x.
>  */
>
> This comment's nonsense of course: ln(0.9) is about -0.105, and ln(1.1) is
> about 0.095, which is not even negative much less "very large".

That's nonsense. The comment is perfectly correct. It's not saying the
logarithm is negative, it's saying that the *weight* of the logarithm
is negative. For the examples you cite, the weight is small and
negative (around -1), but for inputs closer to 1 it can be large and
negative.


>  We could
> just replace that with "For values reasonably close to 1, we can estimate
> the result using ln(var) = ln(1+x) ~= x."  I am wondering what's the point
> though: why not flush this entire branch in favor of always using the
> generic case?  I rather doubt that on any modern machine two uses of
> cmp_var plus init_var/sub_var/free_var are going to be cheaper than the
> log() call you save.  You would need a different way to special-case 1.0,
> but that's not expensive.
>

No, you're missing the point entirely. Suppose var =
1.1, in which case ln(var) is approximately 1e-21.
The code in the first branch uses the approximation ln(var) = ln(1+x)
~= x = var-1 to see that the weight of ln(var) is around -21. You
can't do that with the code in second branch just by looking at the
first couple of digits of var and doing a floating point
approximation, because all you'd see would be 1.000 and your
approximation to the logarithm would be orders of magnitude out (as is
the case with the code that this function is replacing, which comes
with no explanatory comments at all).


> Larger questions are
>
> (1) why the Abs() in the specification of estimate_ln_weight --- that
> doesn't comport with the text about "Estimate the weight of the most
> significant digit".

Yes it does, and the formula is correct. The function returns an
estimate for the weight of the logarithm of var. Let L = ln(var), then
what we are trying to return is the weight of L. So we don't care
about the sign of L, we just need to know it's weight -- i.e., we want
approximately log10(Abs(L)) which is log10(Abs(ln(var))) as the
comment says. The Abs() is necessary because L might be negative (when
0 < var < 1).

>  (The branch I'm proposing you remove fails to
> honor that anyway.)
>

No it doesn't. It honours the Abs() by ignoring the sign of x, and
just looking at its weight.


> (2) what should the behavior be for input == 1, and why?  The code
> is returning zero, but that seems inconsistent or at least
> underdocumented.
>

ln(1) is 0, which has a weight of 0. I suppose you could argue that
technically 0 could have any weight, but looking at the bigger
picture, what this function is for is deciding how many digits of
precision are needed in intermediate calculations to retain full
precision in the final result. When the input is exactly 1, the
callers will have nice exact results, and no extra intermediate
precision is needed, so returning 0 is quite sensible.

I guess adding words to that effect to the comment makes sense, since
it clearly wasn't obvious.


> (3) if it's throwing an error for zero input, why not for negative
> input?  I'm not sure that either behavior is within the purview of
> this function, anyway.  Seems like returning zero might be fine.
>
> regards, tom lane

[Shrug] It doesn't make much difference since both those error cases
will be caught a little later on in the callers, but since this
function needs to test for an empty digit array in the second branch
anyway, it seemed like it might as well report that error there.
Returning zero would be fine too.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Thom Brown
On 13 November 2015 at 15:22, Amit Kapila  wrote:
> On Fri, Nov 13, 2015 at 7:59 PM, Thom Brown  wrote:
>>
>> On 13 November 2015 at 13:38, Amit Kapila  wrote:
>> > On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule
>> > 
>> > wrote:
>> >>
>> >>
>> >> yes - the another little bit unclean in EXPLAIN is number of workers.
>> >> If I
>> >> understand to the behave, the query is processed by two processes if
>> >> workers
>> >> in the explain is one.
>> >>
>> >
>> > You are right and I think that is current working model of Gather
>> > node which seems okay. I think the more serious thing here
>> > is that there is possibility that Explain Analyze can show the
>> > number of workers as more than actual workers working for Gather
>> > node.  We have already discussed that Explain Analyze should
>> > the actual number of workers used in query execution, patch for
>> > the same is still pending.
>>
>> This may have already been discussed before, but in a verbose output,
>> would it be possible to see the nodes for each worker?
>>
>
> There will be hardly any difference in nodes for each worker and it could
> be very long plan for large number of workers.  What kind of additional
> information you want which can't be shown in current format.

For explain plans, not that useful, but it's useful to see how long
each worker took for explain analyse.  And I imagine as more
functionality is added to scan partitions and foreign scans, it will
perhaps be more useful when the plans won't be identical. (or would
they?)

>>
>> And perhaps associated PIDs?
>>
>
> Yeah, that can be useful, if others also feel like it is important, I can
> look into preparing a patch for the same.

Thanks.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-13 Thread Ildus Kurbangaliev
On Thu, 12 Nov 2015 14:59:59 -0500
Robert Haas  wrote:

> On Wed, Nov 11, 2015 at 6:50 AM, Ildus Kurbangaliev
>  wrote:
> > Attached a new version of the patch that moves SLRU tranches and LWLocks to
> > SLRU control structs.
> >
> > `buffer_locks` field now contains LWLocks itself, so we have some economy of
> > the memory here. `pstrdup` removed in SimpleLruInit. I didn't
> > change names from the previous patch yet, but I don't mind if they'll
> > be changed.
> 
> I've committed this with some modifications.  I did a little bit of
> renaming, and I stored a copy of the SLRU name in shared memory.
> Otherwise, we're relying on the fact that a static string will be at
> the same backend-private address in every process, which is a
> dangerous assumption at best.

Thanks! That's very inspiring.
I have some questions about next steps on other tranches.
First of all, I think we can have two API calls, something like: 

1) LWLockRequestTranche(char *tranche_name, int locks_count)
2) LWLockGetTranche(char *tranche_name)

LWLockRequestTranche reserve an item in main tranches array in shared memory, 
and
allocates space for name, LWLockTranche and LWLocks. There is first question. 
It is
ok if this call will be from *ShmemSize functions? We keep those requests,
calculate a required size in LWLockShmemSize (by moving this call to the end)
and create all these tranches in CreateLWLocks.

And the second issue. We have tranches created manually by backends (xlog, now 
SLRU),
and they located in their shared memory. I suggest for them just reserving an 
item 
in the tranches array with LWLockRequestTranche(NULL, 0).

About tranches that are creating after Postgres initialization I suggest don't 
keep them in the shared memory, just to use a local array like in the current 
implementation.
Their name will not be read, we can determine by id that they are not in shared 
memory, and
show "other" in pg_stat_activity and other places.

-
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Amit Kapila
On Fri, Nov 13, 2015 at 7:59 PM, Thom Brown  wrote:
>
> On 13 November 2015 at 13:38, Amit Kapila  wrote:
> > On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule 
> > wrote:
> >>
> >>
> >> yes - the another little bit unclean in EXPLAIN is number of workers.
If I
> >> understand to the behave, the query is processed by two processes if
workers
> >> in the explain is one.
> >>
> >
> > You are right and I think that is current working model of Gather
> > node which seems okay. I think the more serious thing here
> > is that there is possibility that Explain Analyze can show the
> > number of workers as more than actual workers working for Gather
> > node.  We have already discussed that Explain Analyze should
> > the actual number of workers used in query execution, patch for
> > the same is still pending.
>
> This may have already been discussed before, but in a verbose output,
> would it be possible to see the nodes for each worker?
>

There will be hardly any difference in nodes for each worker and it could
be very long plan for large number of workers.  What kind of additional
information you want which can't be shown in current format.

>
> And perhaps associated PIDs?
>

Yeah, that can be useful, if others also feel like it is important, I can
look into preparing a patch for the same.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Jeff Janes
On Wed, Nov 11, 2015 at 6:53 AM, Robert Haas  wrote:
>
> I've committed most of this, except for some planner bits that I
> didn't like, and after a bunch of cleanup.  Instead, I committed the
> consider-parallel-v2.patch with some additional planner bits to make
> up for the ones I removed from your patch.  So, now we have parallel
> sequential scan!

Pretty cool.  All I had to do is mark my slow plperl functions as
being parallel safe, and bang, parallel execution of them for seq
scans.

But, there does seem to be a memory leak.

The setup (warning: 20GB of data):

create table foobar as select md5(floor(random()*150)::text) as
id, random() as volume
  from generate_series(1,2);

set max_parallel_degree TO 8;

explain select count(*) from foobar where volume >0.9;
  QUERY PLAN
---
 Aggregate  (cost=2626202.44..2626202.45 rows=1 width=0)
   ->  Gather  (cost=1000.00..2576381.76 rows=19928272 width=0)
 Number of Workers: 7
 ->  Parallel Seq Scan on foobar  (cost=0.00..582554.56
rows=19928272 width=0)
   Filter: (volume > '0.9'::double precision)


Now running this query leads to an OOM condition:

explain (analyze, buffers) select count(*) from foobar where volume >0.9;
WARNING:  terminating connection because of crash of another server process


Running it without the explain also causes the problem.

Memory dump looks like at some point before the crash looks like:

TopMemoryContext: 62496 total in 9 blocks; 16976 free (60 chunks); 45520 used
  TopTransactionContext: 8192 total in 1 blocks; 4024 free (8 chunks); 4168 used
ExecutorState: 1795153920 total in 223 blocks; 4159872 free (880
chunks); 1790994048 used
  ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
  Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  other insignificant stuff...

I don't have enough RAM for each of 7 workers to use all that much more than 2GB

work_mem is 25MB, maintenance work_mem is 64MB

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Foreign Data Wrapper

2015-11-13 Thread Big Mike
Writing a Foreign Data Wrapper and interested in isolating the WHERE clause
to speed up the access of an indexed file on my filesystem. I'm attempting
to understand the inner workings of how the data is retrieved so I'm
writing code to just handle one case at the moment: WHERE clause on a
single column in the foreign 'table'.

SELECT * FROM t WHERE testval = 1

I have this code so far, an implementation of the IterateForeignScan
interface.

static TupleTableSlot *
bIterateForeignScan(ForeignScanState *node) {
...
RestrictInfo *rinfo = (RestrictInfo *)node->ss.ps.qual;
...
}

yet am not familiar with what I need to do to pick apart RestrictInfo in
order to gather 'testvar', '=', and '1' separately so I can interpret and
pass those through to my file parser.

Am I going about this the correct way or is there another path I should
follow?


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-13 Thread Andres Freund
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
> On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs  wrote:
> >
> > On 1 October 2015 at 23:30, Josh Berkus  wrote:
> >>
> >> On 10/01/2015 07:43 AM, Robert Haas wrote:
> >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao 
> wrote:
> >> >> I wonder how much it's worth renaming only the file extension while
> >> >> there are many places where "visibility map" and "vm" are used,
> >> >> for example, log messages, function names, variables, etc.
> >> >
> >> > I'd be inclined to keep calling it the visibility map (vm) even if it
> >> > also contains freeze information.
> >> >
> 
> What is your main worry about changing the name of this map, is it
> about more code churn or is it about that we might introduce new issues
> or is it about that people are already accustomed to call this map as
> visibility map?

Several:
* Visibility map is rather descriptive, none of the replacement terms
  imo come close. Few people will know what a 'freeze' map is.
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database
  directory to change their tools

On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.

> It seems to me quite logical for understanding purpose as well.  Any new
> person who wants to work in this area or is looking into it will always
> wonder why this map is named as visibility map even though it contains
> information about visibility of page as well as frozen state of page.

Being frozen is about visibility as well.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign Data Wrapper

2015-11-13 Thread Corey Huinker
On Fri, Nov 13, 2015 at 1:46 PM, Big Mike  wrote:

> Writing a Foreign Data Wrapper and interested in isolating the WHERE
> clause to speed up the access of an indexed file on my filesystem. I'm
> attempting to understand the inner workings of how the data is retrieved so
> I'm writing code to just handle one case at the moment: WHERE clause on a
> single column in the foreign 'table'.
>
> SELECT * FROM t WHERE testval = 1
>
> I have this code so far, an implementation of the IterateForeignScan
> interface.
>
> static TupleTableSlot *
> bIterateForeignScan(ForeignScanState *node) {
> ...
> RestrictInfo *rinfo = (RestrictInfo *)node->ss.ps.qual;
> ...
> }
>
> yet am not familiar with what I need to do to pick apart RestrictInfo in
> order to gather 'testvar', '=', and '1' separately so I can interpret and
> pass those through to my file parser.
>
> Am I going about this the correct way or is there another path I should
> follow?
>


I would look at http://multicorn.org/ which gives you a working python
framework. You subclass their ForeignDataWrapper class, override the
__init__() and execute() functions, and that's about it. The execute()
function has a list called quals that would set you up for the filtering
you want to do.

I would get the foreign data wrapper fully debugged this way before
attempting to refactor in C. And when you do finally re-code, you can see
what multicorn itself did to implement your filters.


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Tom Lane
Dean Rasheed  writes:
> On 12 November 2015 at 21:01, Tom Lane  wrote:
>> I started to look at this patch, and was immediately bemused by the
>> comment in estimate_ln_weight:

> That's nonsense. The comment is perfectly correct. It's not saying the
> logarithm is negative, it's saying that the *weight* of the logarithm
> is negative.

Ah, you're right --- I'd gotten confused about the distinction between
ln(x) and ln(ln(x)).  Nevermind ...

Next question: in the additional range-reduction step you added to ln_var,
why stop there, ie, what's the rationale for this magic number:

if (Abs((x.weight + 1) * DEC_DIGITS) > 10)

Seems like we arguably should do this whenever the weight isn't zero,
so as to minimize the number of sqrt() steps.  (Yes, I see the point
about not getting into infinite recursion, but that only says that
the threshold needs to be more than 10, not that it needs to be 10^10.)

Also, it seems a little odd to put the recursive calculation of ln(10)
where you did, rather than down where it's used, ie why not

mul_var(result, , result, local_rscale);

ln_var(_ten, _10, local_rscale);
int64_to_numericvar((int64) pow_10, );
mul_var(_10, , , local_rscale);
add_var(result, , result);

round_var(result, rscale);

As you have it, ln_10 will be calculated with possibly a smaller rscale
than is used in this stanza.  That might be all right but it seems dubious
--- couldn't the lower-precision result leak into digits we care about?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-13 Thread Bruce Momjian
On Tue, Nov  3, 2015 at 09:03:49AM +0530, Amit Kapila wrote:
> I think in that case we can call it as page info map or page state map, but
> I find retaining visibility map name in this case or for future (if we want to
> add another bit) as confusing.  In-fact if you find "visibility and freeze
> map",
> as excessively long, then we can change it to "page info map" or "page state
> map" now as well.

Coming in late here, but the problem with "page info map" is that free
space is also page info (how much free space on each page), so "page
info map" isn't very descriptive.  "page status" or "page state" might
make more sense, but even then, free space is a kind of page
status/state.  What is happening is that broadening the name to cover
both visibility and freeze state also encompasses free space.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Dean Rasheed
On 13 November 2015 at 18:36, Tom Lane  wrote:
> Next question: in the additional range-reduction step you added to ln_var,
> why stop there, ie, what's the rationale for this magic number:
>
> if (Abs((x.weight + 1) * DEC_DIGITS) > 10)
>
> Seems like we arguably should do this whenever the weight isn't zero,
> so as to minimize the number of sqrt() steps.  (Yes, I see the point
> about not getting into infinite recursion, but that only says that
> the threshold needs to be more than 10, not that it needs to be 10^10.)
>

It's a bit arbitrary. There is a tradeoff here -- computing ln(10) is
more expensive than doing a sqrt() since the Babylonian algorithm used
for sqrt() is quadratically convergent, whereas the Taylor series for
ln() converges more slowly. At the default precision, ln(10) is around
7 times slower than sqrt() on my machine, although that will vary with
precision, and the sqrt()s increase the local rscale and so they will
get slower. Anyway, it seemed reasonable to not do the extra ln()
unless it was going to save at least a couple of sqrt()s.


> Also, it seems a little odd to put the recursive calculation of ln(10)
> where you did, rather than down where it's used, ie why not
>
> mul_var(result, , result, local_rscale);
>
> ln_var(_ten, _10, local_rscale);
> int64_to_numericvar((int64) pow_10, );
> mul_var(_10, , , local_rscale);
> add_var(result, , result);
>
> round_var(result, rscale);
>
> As you have it, ln_10 will be calculated with possibly a smaller rscale
> than is used in this stanza.  That might be all right but it seems dubious
> --- couldn't the lower-precision result leak into digits we care about?
>

Well it still has 8 digits more than the final rscale, so it's
unlikely to cause any loss of precision when added to the result and
then rounded to rscale. But on the other hand it does look neater to
compute it there, right where it's needed.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Tom Lane
Dean Rasheed  writes:
> On 13 November 2015 at 18:36, Tom Lane  wrote:
>> Seems like we arguably should do this whenever the weight isn't zero,
>> so as to minimize the number of sqrt() steps.

> It's a bit arbitrary. There is a tradeoff here -- computing ln(10) is
> more expensive than doing a sqrt() since the Babylonian algorithm used
> for sqrt() is quadratically convergent, whereas the Taylor series for
> ln() converges more slowly. At the default precision, ln(10) is around
> 7 times slower than sqrt() on my machine, although that will vary with
> precision, and the sqrt()s increase the local rscale and so they will
> get slower. Anyway, it seemed reasonable to not do the extra ln()
> unless it was going to save at least a couple of sqrt()s.

OK --- I think I miscounted how many sqrt's we could expect to save.

One more thing: the approach you used in power_var() of doing a whole
separate exp * ln(base) calculation to approximate the result weight
seems mighty expensive, even if it is done at minimal precision.
Couldn't we get a good-enough approximation using basically
numericvar_to_double_no_overflow(exp) * estimate_ln_weight(base) ?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-13 Thread David G. Johnston
On Fri, Nov 13, 2015 at 1:54 PM, Catalin Iacob 
wrote:

> So I promised I'd try to document this. I had a look at the proposed
> semantics of -C and I think in the patch they're too complicated which
> makes explaining them hard.
>
> My assumptions about behaviour without this patch, from reading the
> docs and some experimenting, correct me if I'm wrong:
>
> 1. psql normally splits its input by ; let's call each piece of the
> split a statement
>
> 2. for every statement resulting after 1, if it's a \ command it's
> interpreted internally, else a query with it is sent to the server,
> the result is displayed
>
> 3. 1. and 2. happen when the input comes from a file (-f) or from stdin
>
> 4. autocommit off changes behaviour in that it sends a BEGIN before
> any of the statements after the split in 1 (except for \ commands,
> BEGIN or things like VACUUM which don't work within transactions)
>
> 5. --single-transaction changes behaviour in that it puts a BEGIN
> before the whole input (not around each statement) and a COMMIT after
>
> 6. all of the above DON'T apply for -c which very different things: it
> doesn't split and instead it sends everything, in one query to the
> backend. The backend can execute such a thing (it splits itself by ;)
> except in some cases like SELECT + VACUUM. Since the single query is
> effectively a single transaction for the backend -c ignores
> --single-transaction and autocommit off. Even more, when executing
> such a multiple statement the backend only returns results for the
> last statement of the query.
>
> From the above it seems -c is a different thing altogether while other
> behaviour allows 1 input with multiple commands, multiple results and
> works the same on stdin and a file.
>
> So my proposal is: allow a *single* argument for -C and treat its
> content *exactly* like the input from stdin or from a file.
>
> This answers all the questions about interactions with
> --single-transaction and autocommit naturally: it behaves exactly like
> stdin and -f behave today. And having a single parameter is similar to
> having a single file or single stdin. Having multiple -C is also
> confusing since it seems the statements in one -C are grouped somehow
> and the ones in the next -C are another group so this starts feeling
> like there's maybe a transaction per -C group etc.
>
> Am I missing something or is it that simple?
>

​While not in patch form here is some food for thought.

Tweaks to -c to link it with -C

​6c6
<   Specifies that psql is to execute one
---
>   Specifies that psql is to execute the
12d11
<   
32a32,36
>   Furthermore, only a single instance of this parameter is accepted.
>   Attempting to provide multiple instances will result in the entire
>   shell command failing.
>   
>   
34,35c38,41
the -c string often has unexpected results.  Two
>better options are available to execute multiple commands in a
>controlled manner.  You may use the -C option, described next, or
>choose to feed multiple commands to
psql's

​Draft -C thoughts

​  -C command(s)
  --multi-command=command(s)
  
  
  Specifies that psql is to execute one or
  more command strings, commands,
  and then exit.  This differs from -c in that multiple instances may
be present
  on the same shell command.
  
  
  Also unlike -c, individual -C commands and
statements are executed
  in auto-commit mode.  The following pseudo-code example describe the
script
  that is effectively created.
  

psql -C 'SELECT 1;SELECT 2' -C 'SELECT 3;SELECT4'
psql EOF
BEGIN;
SELECT 1;
COMMIT;
BEGIN;
SELECT 2;
COMMIT;
BEGIN;
SELECT 3;
COMMIT;
BEGIN;
SELECT 4;
COMMIT;
EOF

  
  Alternatively the option --single-transaction makes
the entire multi-command execute
  within a single transaction.  There is no option to have entire
-C commands commit
  independently of each other; you have to issue separate psql shell
commands.
  
  
  Output from the -C command behaves more script-like
than -c as each
  statement within each command is output.
  
  
  As with -c the Start-up files
(psqlrc and ~/.psqlrc)
  are ignored if this option is present on the command-line.
  
  
  One particular motivation for introducing -C is the
first command below fails if executed
  using -c but now there are two equivalent command
lines that work.

psql -Atq -C "VACUUM FULL foo; SELECT pg_relation_size('foo')"
psql -Atq -C "VACUUM FULL foo" -C "SELECT pg_relation_size('foo')"

  


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Dean Rasheed
On 13 November 2015 at 21:00, Tom Lane  wrote:
> BTW, something I find confusing and error-prone is that this patch keeps
> on using the term "weight" to refer to numbers expressed in decimal digits
> (ie, the approximate log10 of something).  Basically everywhere in the
> existing code, "weights" are measured in base-NBASE digits, while "scales"
> are measured in decimal digits.  I've not yet come across anyplace where
> you got the units wrong, but it seems like a gotcha waiting to bite the
> next hacker.
>
> I thought for a bit about s/weight/scale/g in the patch, but that is not
> le mot juste either, since weight is generally counting digits to the left
> of the decimal point while scale is generally counting digits to the
> right.
>
> The best idea that has come to me is to use "dweight" to refer to a weight
> measured in decimal digits.  Anyone have a better thought?
>

Yeah, dweight works for me.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] check for interrupts in set_rtable_names

2015-11-13 Thread Tom Lane
Jeff Janes  writes:
> Someone sent my server a deranged query, it tripped my
> auto_explain.log_min_duration setting, that hit some kind of
> pathological case while assigning aliases, and now it sits
> uninterruptibly in set_rtable_names for hours.

> Is there any reason we can't check for interrupts in set_rtable_names,
> like the attached?

There's probably no reason not to do that, but I'd be much more interested
in eliminating the slowness to begin with ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Tom Lane
BTW, something I find confusing and error-prone is that this patch keeps
on using the term "weight" to refer to numbers expressed in decimal digits
(ie, the approximate log10 of something).  Basically everywhere in the
existing code, "weights" are measured in base-NBASE digits, while "scales"
are measured in decimal digits.  I've not yet come across anyplace where
you got the units wrong, but it seems like a gotcha waiting to bite the
next hacker.

I thought for a bit about s/weight/scale/g in the patch, but that is not
le mot juste either, since weight is generally counting digits to the left
of the decimal point while scale is generally counting digits to the
right.

The best idea that has come to me is to use "dweight" to refer to a weight
measured in decimal digits.  Anyone have a better thought?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-13 Thread Catalin Iacob
So I promised I'd try to document this. I had a look at the proposed
semantics of -C and I think in the patch they're too complicated which
makes explaining them hard.

My assumptions about behaviour without this patch, from reading the
docs and some experimenting, correct me if I'm wrong:

1. psql normally splits its input by ; let's call each piece of the
split a statement

2. for every statement resulting after 1, if it's a \ command it's
interpreted internally, else a query with it is sent to the server,
the result is displayed

3. 1. and 2. happen when the input comes from a file (-f) or from stdin

4. autocommit off changes behaviour in that it sends a BEGIN before
any of the statements after the split in 1 (except for \ commands,
BEGIN or things like VACUUM which don't work within transactions)

5. --single-transaction changes behaviour in that it puts a BEGIN
before the whole input (not around each statement) and a COMMIT after

6. all of the above DON'T apply for -c which very different things: it
doesn't split and instead it sends everything, in one query to the
backend. The backend can execute such a thing (it splits itself by ;)
except in some cases like SELECT + VACUUM. Since the single query is
effectively a single transaction for the backend -c ignores
--single-transaction and autocommit off. Even more, when executing
such a multiple statement the backend only returns results for the
last statement of the query.

>From the above it seems -c is a different thing altogether while other
behaviour allows 1 input with multiple commands, multiple results and
works the same on stdin and a file.

So my proposal is: allow a *single* argument for -C and treat its
content *exactly* like the input from stdin or from a file.

This answers all the questions about interactions with
--single-transaction and autocommit naturally: it behaves exactly like
stdin and -f behave today. And having a single parameter is similar to
having a single file or single stdin. Having multiple -C is also
confusing since it seems the statements in one -C are grouped somehow
and the ones in the next -C are another group so this starts feeling
like there's maybe a transaction per -C group etc.

Am I missing something or is it that simple?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-13 Thread Robert Haas
On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai  wrote:
> I'm now designing the parallel feature of Append...
>
> Here is one challenge. How do we determine whether each sub-plan
> allows execution in the background worker context?

I've been thinking about these questions for a bit now, and I think we
can work on improving Append in multiple phases.  The attached patch
shows what I have in mind for phase 1.  Currently, if you set up an
inheritance hierarchy, you might get an Append some of whose children
are Gather nodes with Parallel Seq Scans under them, like this:

Append
-> Gather
  -> Parallel Seq Scan
-> Gather
  -> Parallel Seq Scan
-> Seq Scan

This is a crappy plan.  Each Gather node will try to launch its own
bunch of workers, which sucks.  The attached patch lets you get this
plan instead:

Gather
-> Append
  -> Partial Seq Scan
  -> Partial Seq Scan
  -> Partial Seq Scan

That's a much better plan.

To make this work, this plan introduces a couple of new concepts.
Each RelOptInfo gets a new partial_pathlist field, which stores paths
that need to be gathered to produce a workable plan.  Once we've
populated the partial_pathlist with whatever partial paths we know how
to generate, we can either push a Gather node on top of one of those
partial paths to create a real path; or we can use those partial paths
to build more partial paths.  The current patch handles only the
simplest case: we can build a partial path for an appendrel by
appending a partial path for each member rel.  But eventually I hope
to handle joinrels this way as well: you can join a partial path with
an ordinary path for form a partial path for the joinrel.

This requires some way of figuring out how many workers to request for
the append-path, so this patch also adds a parallel_degree field to
the path object, which is 0 for non-parallel things and the number of
workers that the path believes to be ideal otherwise.  Right now, it
just percolates the highest worker count of any child up to the
appendrel, which might not be right, especially once the append node
knows how to balance workers, but it seems like a reasonable place to
start.

> Type-A) It can be performed on background worker context and
>picked up by multiple worker processes concurrently.
>   (e.g: Parallel SeqScan)
> Type-B) It can be performed on background worker context but
>   cannot be picked up by multiple worker processes.
>   (e.g: non-parallel aware node)
> Type-C) It should not be performed on background worker context.
>(e.g: plan/path node with volatile functions)

At the time that we're forming an AppendPath, we can identify what
you're calling type-A paths very easily: they're the ones that are
coming from the partial_pathlist.  We can distinguish between type-B
and type-C paths coming from the childrel's pathlist based on the
childrel's consider_parallel flag: if it's false, it's type-C, else
type-B.  At some point, we might need a per-path flag to distinguish
cases where a particular path is type-C even though some other plan
for that relation might be type-B.  But right now that case doesn't
arise.

Now, of course, it's not good enough to have this information
available when we're generating the AppendPath; it has to survive
until execution time.  But that doesn't mean the paths need to be
self-identifying.  We could, of course, decide to make them so, and
maybe that's the best design, but I'm thinking about another approach:
suppose the append node itself, instead of having one list of child
plans, has a list of type-A plans, a list of type-B plans, and a list
of type-C plans.  This actually seems quite convenient, because at
execution time, you presumably want the leader to prioritize type-C
plans over any of the others, since it's the only one that can execute
them, and the workers to prioritize type-B plans, since they can only
take one worker each and are thus prone to be the limiting factor on
finishing the whole Append.  Having the plans divided in advance
between these three lists (say, restricted_plans, safe_plans,
parallel_plans) makes that easy to implement.

Incidentally, I think it's subtly wrong to think of the parallel_aware
flag as telling you whether the plan can absorb multiple workers.
That's not really what it's for.  It's to tell you whether the plan is
doing *something* parallel aware - that is, whether its Estimate,
InitializeDSM, and InitializeWorker callbacks should do anything.  For
SeqScan, flipping parallel_aware actually does split the input among
all the workers, but for Append it's probably just load balances and
for other nodes it might be something else again.  The term I'm using
to indicate a path/plan that returns only a subset of the results in
each worker is "partial".  Whether or not a path is partial is, in the
design embodied in this patch, indicated both by whether
path->parallel_degree > 0 and whether the path is in rel->pathlist or
rel->partial_pathlist.

-- 
Robert Haas

[HACKERS] check for interrupts in set_rtable_names

2015-11-13 Thread Jeff Janes
Someone sent my server a deranged query, it tripped my
auto_explain.log_min_duration setting, that hit some kind of
pathological case while assigning aliases, and now it sits
uninterruptibly in set_rtable_names for hours.

Is there any reason we can't check for interrupts in set_rtable_names,
like the attached?

I've tried it on a deranged query and it seems to do the job.

A deranged query:

perl -le 'print "explain"; print "select * from pgbench_accounts where
aid=5 union" foreach 1..5000; print "select * from pgbench_accounts
where aid=5 ;"'|psql

Cheers,

Jeff


rtable_names_interrupts.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Amit Kapila
On Fri, Nov 13, 2015 at 9:16 PM, Thom Brown  wrote:
>
> On 13 November 2015 at 15:22, Amit Kapila  wrote:
> >
> > There will be hardly any difference in nodes for each worker and it
could
> > be very long plan for large number of workers.  What kind of additional
> > information you want which can't be shown in current format.
>
> For explain plans, not that useful, but it's useful to see how long
> each worker took for explain analyse.
>

The statistics related to buffers, timing and infact rows filtered will be
different for each worker, so it sounds sensible to me to display separate
plan info for each worker or at least display the same in verbose or some
other mode and then display aggregated information at Gather node.  The
only point that needs more thought is that parallel plans will look big for
many number of workers.  I think this will need somewhat substantial
changes than what is done currently for parallel seq scan, so it is better
if others also share their opinion about this form of display of information
for parallel queries.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] check for interrupts in set_rtable_names

2015-11-13 Thread Jeff Janes
On Fri, Nov 13, 2015 at 3:13 PM, Tom Lane  wrote:
> Jeff Janes  writes:
>> Someone sent my server a deranged query, it tripped my
>> auto_explain.log_min_duration setting, that hit some kind of
>> pathological case while assigning aliases, and now it sits
>> uninterruptibly in set_rtable_names for hours.
>
>> Is there any reason we can't check for interrupts in set_rtable_names,
>> like the attached?
>
> There's probably no reason not to do that, but I'd be much more interested
> in eliminating the slowness to begin with ...

I was thinking about that as well, but I don't think that would be
back-patchable, at least not the way I was envisioning it.

I was thinking of detecting bad cases (had to count to over 10 before
finding a novel name, more than 10 times) and then switching from an
object-local count, to a global count, for the numbers to add to the
name.  But that would be a behavior change under some conditions.

I think the code says it clearer than prose, so crude first attempt at
that attached.

Cheers,

Jeff


rtable_names_faster.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Dean Rasheed
On 13 November 2015 at 23:10, Tom Lane  wrote:
> One more thing: the approach you used in power_var() of doing a whole
> separate exp * ln(base) calculation to approximate the result weight
> seems mighty expensive, even if it is done at minimal precision.
> Couldn't we get a good-enough approximation using basically
> numericvar_to_double_no_overflow(exp) * estimate_ln_weight(base) ?
>

I can't see a way to make that work reliably. It would need to be
10^estimate_ln_weight(base) and the problem is that both exp and
ln_weight could be too big to fit in double variables, and become
HUGE_VAL, losing all precision.

An interesting example is the limit of (1+1/x)^x as x approaches
infinity which is e (the base of natural logarithms), so in that case
both the exponent and ln_weight could be arbitrarily big (well too big
for doubles anyway). For example (1+1/1.2e+500)^(1.2e500) =
2.7182818284...

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Dean Rasheed
On 14 November 2015 at 00:16, Dean Rasheed  wrote:
> I can't see a way to make that work reliably. It would need to be
> 10^estimate_ln_weight(base) and the problem is that both exp and
> ln_weight could be too big to fit in double variables, and become
> HUGE_VAL, losing all precision.

Of course I meant 10^ln_weight could be too big to fit in a double.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-13 Thread Tom Lane
Dean Rasheed  writes:
> On 14 November 2015 at 00:16, Dean Rasheed  wrote:
>> I can't see a way to make that work reliably. It would need to be
>> 10^estimate_ln_weight(base) and the problem is that both exp and
>> ln_weight could be too big to fit in double variables, and become
>> HUGE_VAL, losing all precision.

> Of course I meant 10^ln_weight could be too big to fit in a double.

Drat, that's the second time in this review that I've confused
ln_weight(x) with ln(x).  Time to call it a day ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-11-13 Thread Amit Kapila
On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas  wrote:
> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
>  wrote:
>
> >> I think it's reasonable to consider reporting this data in the PGPROC
> >> using a 4-byte integer rather than reporting it through a singe byte
> >> in the backend status structure.  I believe that addresses the
> >> concerns about reporting from auxiliary processes, and it also allows
> >> a little more data to be reported.  For anything in excess of that, I
> >> think we should think rather harder.  Most likely, such addition
> >> detail should be reported only for certain types of wait events, or on
> >> a delay, or something like that, so that the core mechanism remains
> >> really, really fast.
> >
> > That sounds reasonable. There are many pending questions, but it seems
like
> > step forward to me.
>
> Great, let's do it.  I think we should probably do the work to
> separate the non-individual lwlocks into tranches first, though.
>

One thing that occurred to me in this context is that if we store the wait
event information in PGPROC, then can we think of providing the info
about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or
any other better name) where we can display wait information about
all-processes rather than only backends?  This will avoid the confusion
about breaking the backward compatibility for the current 'waiting' column
in pg_stat_activity.

pg_stat_waits can have columns:
pid  - Process Id
wait_class_name - Name of the wait class
wait class_event  - name of the wait event

We can extend it later with the information about timing for wait event.

Also, if we follow this approach, I think we don't need to store this
information in PgBackendStatus.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] pg_rewind and replication slots

2015-11-13 Thread Vladimir Borodin
Hi.

Should pg_rewind ignore pg_replslot dir at all? As it does pg_basebackup, for 
example.

--
May the force be with you…
https://simply.name



Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Thom Brown
On 13 November 2015 at 13:38, Amit Kapila  wrote:
> On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule 
> wrote:
>>
>>
>> yes - the another little bit unclean in EXPLAIN is number of workers. If I
>> understand to the behave, the query is processed by two processes if workers
>> in the explain is one.
>>
>
> You are right and I think that is current working model of Gather
> node which seems okay. I think the more serious thing here
> is that there is possibility that Explain Analyze can show the
> number of workers as more than actual workers working for Gather
> node.  We have already discussed that Explain Analyze should
> the actual number of workers used in query execution, patch for
> the same is still pending.

This may have already been discussed before, but in a verbose output,
would it be possible to see the nodes for each worker?

e.g.

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%de%';
  QUERY
PLAN
--
 Aggregate  (cost=105557.59..105557.60 rows=1 width=0) (actual
time=400.752..400.752 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=175333
   ->  Gather  (cost=1000.00..104931.04 rows=250621 width=0) (actual
time=400.748..400.748 rows=0 loops=1)
 Output: content
 Number of Workers: 2
 Buffers: shared hit=175333
 ->  Parallel Seq Scan on public.js  (cost=0.00..39434.47
rows=125310 width=0) (actual time=182.256..398.14 rows=0 loops=1)
   Output: content
   Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%de%'::text)
   Rows Removed by Filter: 626486
   Buffers: shared hit=87666
 ->  Parallel Seq Scan on public.js  (cost=0.00..39434.47
rows=1253101 width=0) (actual time=214.11..325.31 rows=0 loops=1)
   Output: content
   Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%de%'::text)
   Rows Removed by Filter: 6264867
   Buffers: shared hit=876667
 Planning time: 0.085 ms
 Execution time: 414.713 ms
(14 rows)

And perhaps associated PIDs?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Thom Brown
On 13 November 2015 at 03:39, Amit Kapila  wrote:
> On Thu, Nov 12, 2015 at 9:05 PM, Thom Brown  wrote:
>>
>> On 12 November 2015 at 15:23, Amit Kapila  wrote:
>> > On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule
>> > 
>> > wrote:
>> >>
>> >> Hi
>> >>
>> >> I have a first query
>> >>
>> >> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
>> >> differen
>> >>
>> >
>> > Thanks for the report.  The reason for this problem is that
>> > instrumentation
>> > information from workers is getting aggregated multiple times.  In
>> > ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
>> > will wait for workers to finish and then accumulate stats from workers.
>> > Now ExecShutdownGatherWorkers() could be called multiple times
>> > (once we read all tuples from workers, at end of node) and it should be
>> > ensured that repeated calls should not try to redo the work done by
>> > first
>> > call.
>> > The same is ensured for tuplequeues, but not for parallel executor info.
>> > I think we can safely assume that we need to call ExecParallelFinish()
>> > only
>> > when there are workers started by the Gathers node, so on those lines
>> > attached patch should fix the problem.
>>
>> That fixes the count issue for me, although not the number of buffers
>> hit,
>>
>
> The number of shared buffers hit could be different across different runs
> because the read sequence of parallel workers can't be guaranteed, also
> I don't think same is even guaranteed for Seq Scan node, the other
> operations
> in parallel could lead to different number, however the actual problem was
> that in one of the plans shown by you [1], the Buffers hit at Gather node
> (175288) is lesser than the Buffers hit at Parallel Seq Scan node (241201).
> Do you still (after applying above patch) see that Gather node is showing
> lesser hit buffers than Parallel Seq Scan node?

Hmm... that's odd, I'm not seeing the problem now, so maybe I'm mistaken there.

> [1]
>  # explain (analyse, buffers, timing, verbose, costs) select count(*)
> from js where content->'tags'->>'title' like '%design%';
>  QUERY
> PLAN
> 
>  Aggregate  (cost=132489.34..132489.35 rows=1 width=0) (actual
> time=382.987..382.987 rows=1 loops=1)
>Output: count(*)
>Buffers: shared hit=175288
>->  Gather  (cost=1000.00..132488.34 rows=401 width=0) (actual
> time=382.983..382.983 rows=0 loops=1)
>  Output: content
>  Number of Workers: 1
>  Buffers: shared hit=175288
>  ->  Parallel Seq Scan on public.js  (cost=0.00..131448.24
> rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)
>Output: content
>Filter: (((js.content -> 'tags'::text) ->>
> 'title'::text) ~~ '%design%'::text)
>Rows Removed by Filter: 1724810
>Buffers: shared hit=241201
>  Planning time: 0.104 ms
>  Execution time: 403.045 ms
> (14 rows)
>
> Time: 403.596 ms
>
>> or the actual time taken.
>>
>
> Exactly what time you are referring here, Execution Time or actual time
> shown on Parallel Seq Scan node and what problem do you see with
> the reported time?

I'm referring to the Parallel Seq Scan actual time, showing
"379.407..1141.437" with 1 worker, but the total execution time shows
403.045.  If one worker is taking over a second, how come the whole
query was less than half a second?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] eXtensible Transaction Manager API

2015-11-13 Thread Michael Paquier
On Thu, Nov 12, 2015 at 11:23 PM, Robert Haas  wrote:
> On Mon, Nov 9, 2015 at 2:47 PM, Simon Riggs  wrote:
>> On 9 November 2015 at 18:46, Robert Haas  wrote:
>>> One point I'd like to mention is that it's absolutely critical to
>>> design this in a way that minimizes network roundtrips without
>>> compromising correctness.  XC's GTM proxy suggests that they failed to
>>> do that.  I think we really need to look at what's going to be on the
>>> other sides of the proposed APIs and think about whether it's going to
>>> be possible to have a strong local caching layer that keeps network
>>> roundtrips to a minimum.  We should consider whether the need for such
>>> a caching layer has any impact on what the APIs should look like.
>> You mean the caching layer that already exists in XL/XC?
>
> I don't think that's what I mean, no.  If you have an external GTM
> Proxy, then you have missed a trick, because whatever caching it does
> could be done better inside the process that sent the request to the
> proxy.

Definitely. Having a single communication channel between the backends
and the centralized server that communicate with grouped messages
would clearly help enhancing the scalability of the system, though
this is not something the DTM should try to solve IMO.

>>> For example, consider a 10-node cluster where each node has 32 cores
>>> and 32 clients, and each client is running lots of short-running SQL
>>> statements.  The demand for snapshots will be intense.  If every
>>> backend separately requests a snapshot for every SQL statement from
>>> the coordinator, that's probably going to be terrible.  We can make it
>>> the problem of the stuff behind the DTM API to figure out a way to
>>> avoid that, but maybe that's going to result in every DTM needing to
>>> solve the same problems.
>>
>> The whole purpose of that XTM API is to allow different solutions for that
>> to be created. Konstantin has made a very good case for such an API to
>> exist, based around 3 markedly different approaches.
>
> I'm not arguing with that.
>
>> Whether we allow the API into core to be accessible via extensions is a
>> different issue, but it looks fine for its purpose.
>
> I'm not attacking the API.  I'm trying to have a discussion about the
> important design issues in this area.

FWIW, I just looked at the wiki page regarding the DTM, and the set of
routines GTM is actually very close to what XC/XL is doing. Where
XC/XL directly forked the code of Postgres to redirect to the GTM when
fetching a snapshot, TXID, whatever, DTM is using a set of generic
methods to achieve this purpose and replace the in-core calls with a
set of custom functions to relocate to the external instance in charge
of distributing the transaction data. Other differences are that GTM
is extended for global timestamps and global sequence values, but
that's not really related to ACID.

This just reminded me the following message where I wondered about the
possibility to add hooks in those hot code paths...
http://www.postgresql.org/message-id/cab7npqtdjf-58wuf-xz01nkj7wf0e+eukggqhd0igvsod4h...@mail.gmail.com
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind and replication slots

2015-11-13 Thread Michael Paquier
On Fri, Nov 13, 2015 at 7:41 PM, Vladimir Borodin  wrote:
> Should pg_rewind ignore pg_replslot dir at all? As it does pg_basebackup,
> for example.

To keep the code of pg_rewind simple, the decision has been made to
make pg_rewind copy everything, the necessary post-cleanup or
pre-process saving, like the target node's postgresql.conf that would
get overwritten, being at the charge of the upper application layer
calling pg_rewind.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Amit Kapila
On Fri, Nov 13, 2015 at 10:56 AM, Robert Haas  wrote:
>
> On Thu, Nov 12, 2015 at 10:39 PM, Amit Kapila 
wrote:
> > The number of shared buffers hit could be different across different
runs
> > because the read sequence of parallel workers can't be guaranteed, also
> > I don't think same is even guaranteed for Seq Scan node,
>
> The number of hits could be different.  However, it seems like any
> sequential scan, parallel or not, should have a number of accesses
> (hit + read) equal to the size of the relation.  Not sure if that's
> what is happening here.
>

After patch provided above to fix the issue reported by Pavel, that is
the behaviour, but I think there are few more things which we might
want to consider, just refer the below plan:

Total pages in table

postgres=# select relname,relpages from pg_class where relname like '%t2%';
 relname | relpages
-+--
 t2  | 5406
(1 row)


Parallel Plan
-
postgres=# explain (analyze,buffers,timing) select count(*) from t2 where
c1 % 1
0 = 0;
   QUERY PLAN



 Aggregate  (cost=8174.90..8174.91 rows=1 width=0) (actual
time=1055.294..1055.2
94 rows=1 loops=1)
   Buffers: shared hit=446 read=5054
   ->  Gather  (cost=0.00..8162.40 rows=5000 width=0) (actual
time=79.787..959.6
51 rows=10 loops=1)
 Number of Workers: 2
 Buffers: shared hit=446 read=5054
 ->  Parallel Seq Scan on t2  (cost=0.00..8162.40 rows=5000
width=0) (ac
tual time=30.771..2518.844 rows=10 loops=1)
   Filter: ((c1 % 10) = 0)
   Rows Removed by Filter: 90
   Buffers: shared hit=352 read=5054
 Planning time: 0.170 ms
 Execution time: 1059.400 ms
(11 rows)

Lets focus on Buffers and actual time in the above plan:

Buffers - At Parallel Seq Scan node, it shows total of 5406 (352+5054)
buffers which tallys with what is expected.  However at Gather node,
it shows 5500 (446+5054) and the reason for the same is that we
accumulate overall buffer usage for parallel execution of worker which
includes start of node as well, refer ParallelQueryMain() and when the
that gets counted even towards the buffer calculation of Gather node.
The theory behind collecting overall buffer usage for parallel execution
was that we need it for pg_stat_statements where the stats is accumulated
for overall execution not on node-by-node basis refer queryDesc->totaltime
usage in standard_ExecutorRun().
I think here we need to decide what is the right value to display at
Gather node:
1. Display the same number of buffers at Gather node as at Parallel
Seq Scan node.
2. Display the number of buffers at Parallel Seq Scan node plus the
additional buffers used by parallel workers for miscellaneous work
like ExecutorStart(), etc.
3. Don't account for buffers used for parallel workers.
4. Anything better?

Also in conjuction with above, we need to see what should be accounted for
pg_stat_statements?

actual_time -
Actual time at Gather node: actual time = 79.787..959.651
Actual time at Parallel Seq Scan node = 30.771..2518.844

Time at Parallel Seq Scan node is more than time at Gather node as
the time for parallel workers is also accumulated for Parallel Seq Scan
node, whereas some doesn't get accounted for Gather node.  Now it
could be confusing for users because time displayed at Parallel Seq
Scan node will be equal to - time_taken_by_worker-1 +
time_taken_by_worker-2 + ...
This time could be more than the actual time taken by query because each
worker is getting executed parallely but we have accounted the time such
that each one is executing serially.
I think the time for fetching the tuples from workers is already accounted
for Gather node, so may be for Parallel Seq Scan node we can omit
adding the time for each of the parallel workers.

Thoughts?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Amit Kapila
On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule 
wrote:

>
> yes - the another little bit unclean in EXPLAIN is number of workers. If I
> understand to the behave, the query is processed by two processes if
> workers in the explain is one.
>
>
You are right and I think that is current working model of Gather
node which seems okay. I think the more serious thing here
is that there is possibility that Explain Analyze can show the
number of workers as more than actual workers working for Gather
node.  We have already discussed that Explain Analyze should
the actual number of workers used in query execution, patch for
the same is still pending.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] eXtensible Transaction Manager API

2015-11-13 Thread Michael Paquier
On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas  wrote:
> On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier
>  wrote:
>> Sure. Now imagine that the pg_twophase entry is corrupted for this
>> transaction on one node. This would trigger a PANIC on it, and
>> transaction would not be committed everywhere.
>
> If the database is corrupted, there's no way to guarantee that
> anything works as planned.  This is like saying that criticizing
> somebody's disaster recovery plan on the basis that it will be
> inadequate if the entire planet earth is destroyed.

As well as there could be FS, OS, network problems... To come back to
the point, my point is simply that I found surprising the sentence of
Konstantin upthread saying that if commit fails on some of the nodes
we should rollback the prepared transaction on all nodes. In the
example given, in the phase after calling dtm_end_prepare, say we
perform COMMIT PREPARED correctly on node 1, but then failed it on
node 2 because a meteor has hit a server, it seems that we cannot
rollback, instead we had better rolling in a backup and be sure that
the transaction gets committed. How would you rollback the transaction
already committed on node 1? But perhaps I missed something...

>> I am aware of the fact
>> that by definition PREPARE TRANSACTION ensures that a transaction will
>> be committed with COMMIT PREPARED, just trying to see any corner cases
>> with the approach proposed. The DTM approach is actually rather close
>> to what a GTM in Postgres-XC does :)
>
> Yes.  I think that we should try to learn as much as possible from the
> XC experience, but that doesn't mean we should incorporate XC's fuzzy
> thinking about 2PC into PG.  We should not.

Yep.

> One point I'd like to mention is that it's absolutely critical to
> design this in a way that minimizes network roundtrips without
> compromising correctness.  XC's GTM proxy suggests that they failed to
> do that.  I think we really need to look at what's going to be on the
> other sides of the proposed APIs and think about whether it's going to
> be possible to have a strong local caching layer that keeps network
> roundtrips to a minimum.  We should consider whether the need for such
> a caching layer has any impact on what the APIs should look like.

At this time, the number of round trips needed particularly for READ
COMMITTED transactions that need a new snapshot for each query was
really a performance killer. We used DBT-1 (TPC-W) which is less
OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
was quickly reached with 10-20 nodes..

> For example, consider a 10-node cluster where each node has 32 cores
> and 32 clients, and each client is running lots of short-running SQL
> statements.  The demand for snapshots will be intense.  If every
> backend separately requests a snapshot for every SQL statement from
> the coordinator, that's probably going to be terrible.  We can make it
> the problem of the stuff behind the DTM API to figure out a way to
> avoid that, but maybe that's going to result in every DTM needing to
> solve the same problems.

This recalls a couple of things, though in 2009 I was not playing with
servers of this scale.

> Another thing that I think we need to consider is fault-tolerance.
> For example, suppose that transaction commit and snapshot coordination
> services are being provided by a central server which keeps track of
> the global commit ordering.  When that server gets hit by a freak bold
> of lightning and melted into a heap of slag, somebody else needs to
> take over.  Again, this would live below the API proposed here, but I
> think it really deserves some thought before we get too far down the
> path.  XC didn't start thinking about how to add fault-tolerance until
> quite late in the project, I think, and the same could be said of
> PostgreSQL itself: some newer systems have easier-to-use fault
> tolerance mechanisms because it was designed in from the beginning.
> Distributed systems by nature need high availability to a far greater
> degree than single systems, because when there are more nodes, node
> failures are more frequent.

Your memories on the matter are right. In the case of XC, the SPOF
that is GTM has been somewhat made more stable with the creation of a
GTM standby, though it did not solve the scalability limit of having a
centralized snapshot facility. It actually increased the load on the
whole system because for short transactions. Alea jacta est.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Amit Kapila
On Fri, Nov 13, 2015 at 6:17 PM, Thom Brown  wrote:
>
> >
> > The number of shared buffers hit could be different across different
runs
> > because the read sequence of parallel workers can't be guaranteed, also
> > I don't think same is even guaranteed for Seq Scan node, the other
> > operations
> > in parallel could lead to different number, however the actual problem
was
> > that in one of the plans shown by you [1], the Buffers hit at Gather
node
> > (175288) is lesser than the Buffers hit at Parallel Seq Scan node
(241201).
> > Do you still (after applying above patch) see that Gather node is
showing
> > lesser hit buffers than Parallel Seq Scan node?
>
> Hmm... that's odd, I'm not seeing the problem now, so maybe I'm mistaken
there.
>

Thanks for confirming the same.

> >
> >> or the actual time taken.
> >>
> >
> > Exactly what time you are referring here, Execution Time or actual time
> > shown on Parallel Seq Scan node and what problem do you see with
> > the reported time?
>
> I'm referring to the Parallel Seq Scan actual time, showing
> "379.407..1141.437" with 1 worker, but the total execution time shows
> 403.045.  If one worker is taking over a second, how come the whole
> query was less than half a second?
>

Yeah, this could be possible due to the way currently time is accumulated,
see my mail which I sent just before this mail.  I think we might need to do
something, else it could be confusing for users.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com