Re: [HACKERS] Partition-wise aggregation/grouping

2018-01-01 Thread Jeevan Chalke
On Tue, Dec 12, 2017 at 3:43 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> Here are review comments for 0009
> Only full aggregation is pushed on the remote server.
>
> I think we can live with that for a while but we need to be able to push
> down
> partial aggregates to the foreign server. I agree that it needs some
> infrastructure to serialized and deserialize the partial aggregate values,
> support unpartitioned aggregation first and then work on partitioned
> aggregates. That is definitely a separate piece of work.
>

Yep.


>
> +CREATE TABLE pagg_tab_p3 (a int, b int, c text);
>
> Like partition-wise join testcases please use LIKE so that it's easy to
> change
> the table schema if required.
>

Done.


>
> +INSERT INTO pagg_tab_p1 SELECT i % 30, i % 50, to_char(i/30,
> 'FM') FROM generate_series(1, 3000) i WHERE (i % 30) < 10;
> +INSERT INTO pagg_tab_p2 SELECT i % 30, i % 50, to_char(i/30,
> 'FM') FROM generate_series(1, 3000) i WHERE (i % 30) < 20 and (i %
> 30) >= 10;
> +INSERT INTO pagg_tab_p3 SELECT i % 30, i % 50, to_char(i/30,
> 'FM') FROM generate_series(1, 3000) i WHERE (i % 30) < 30 and (i %
> 30) >= 20;
>
> We have to do this because INSERT tuple routing to a foreign partition is
> not
> supported right now.


Yes.


> Somebody has to remember to change this to a single
> statement once that's done.
>

I don't know how we can keep track of it.


>
> +ANALYZE fpagg_tab_p1;
> +ANALYZE fpagg_tab_p2;
> +ANALYZE fpagg_tab_p3;
>
> I thought this is not needed. When you ANALYZE the partitioned table, it
> would
> analyze the partitions as well. But I see that partition-wise join is also
> ANALYZING the foreign partitions separately. When I ran ANALYZE on a
> partitioned table with foreign partitions, statistics for only the local
> tables
> (partitioned and partitions) was updated. Of course this is separate work,
> but
> probably needs to be fixed.
>

Hmm.


>
> +-- When GROUP BY clause matches with PARTITION KEY.
> +-- Plan when partition-wise-agg is disabled
>
> s/when/with/
>
> +-- Plan when partition-wise-agg is enabled
>
> s/when/with/
>

Done.


>
> +   ->  Append
>
> Just like ForeignScan node's Relations tell what kind of ForeignScan this
> is,
> may be we should annotate Append to tell whether the children are joins,
> aggregates or relation scans. That might be helpful. Of course as another
> patch.
>

OK.


>
> +SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING
> avg(b) < 25 ORDER BY 1;
> + a  | sum  | min | count
> ++--+-+---
> +  0 | 2000 |   0 |   100
> +  1 | 2100 |   1 |   100
> [ ... clipped ...]
> + 23 | 2300 |   3 |   100
> + 24 | 2400 |   4 |   100
> +(15 rows)
>
> May be we want to reduce the number of rows to a few by using a stricter
> HAVING
> clause?
>

Done.


>
> +
> +-- When GROUP BY clause not matches with PARTITION KEY.
>
> ... clause does not match ...
>

Done.


>
> +EXPLAIN (VERBOSE, COSTS OFF)
> +SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
> sum(a) < 800 ORDER BY 1;
> +
> QUERY PLAN
> +---
> 
> --
> + Sort
> +   Output: fpagg_tab_p1.b, (avg(fpagg_tab_p1.a)),
> (max(fpagg_tab_p1.a)), (count(*))
> +   ->  Partial HashAggregate
> [ ... clipped ... ]
> + Output: fpagg_tab_p3.b, PARTIAL
> avg(fpagg_tab_p3.a), PARTIAL max(fpagg_tab_p3.a), PARTIAL count(*),
> PARTIAL sum(fpagg_tab_p3.a)
> + Group Key: fpagg_tab_p3.b
> + ->  Foreign Scan on public.fpagg_tab_p3
> +   Output: fpagg_tab_p3.b, fpagg_tab_p3.a
> +   Remote SQL: SELECT a, b FROM public.pagg_tab_p3
> +(26 rows)
>
> I think we interested in overall shape of the plan and not the details of
> Remote SQL etc. So, may be turn off VERBOSE. This comment applies to an
> earlier
> plan with enable_partition_wise_agg = false;
>

OK. Removed VERBOSE from all the queries as we are interested in overall
shape of the plan.


>
> +
> +SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
> sum(a) < 800 ORDER BY 1;
> + b  | avg | max | count
> ++-+-+---
> +  0 | 10. |  20 |60
> +  1 | 11. |  21 |60
> [... clipped ...]
> + 42 | 12. |  22 |60
> + 43 | 13. |  23 |60
> +(20 rows)
>
> Since the aggregates were not pushed down, I doubt we should be testing the
> output. But this test is good to check partial aggregates over foreign
> partition scans, which we don't have in postgres_fdw.sql I think. So, may
> be
> add it as a separate patch?
>

Agree. Removed SELECT query. EXPLAIN is enough here.


>
> Can you please add a test where we reference a whole-row; that usually has
> troubles.
>

Added.


>
> -if (root->hasHavingQual && 

Re: [HACKERS] Partition-wise aggregation/grouping

2018-01-01 Thread Jeevan Chalke
Attached patchset with all the review comments reported so far.

On Fri, Dec 1, 2017 at 6:11 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> Continuing with review of 0007.
>
> +
> +/* Copy input rels's relids to grouped rel */
> +grouped_rel->relids = input_rel->relids;
>
> Isn't this done in fetch_upper_rel()? Why do we need it here?
> There's also a similar hunk in create_grouping_paths() which doesn't look
> appropriate. I guess, you need relids in grouped_rel->relids for FDW.
> There are
> two ways to do this: 1. set grouped_rel->relids for parent upper rel as
> well,
> but then we should pass relids to fetch_upper_rel() instead of setting
> those
> later. 2. For a parent upper rel, in create_foreignscan_plan(), set relids
> to
> all_baserels, if upper_rel->relids is NULL and don't set relids for a
> parent
> upper rel. I am fine with either of those.
>

Done. Opted second option.


>
> +/* partial phase */
> +get_agg_clause_costs(root, (Node *) partial_target->exprs,
> + AGGSPLIT_INITIAL_SERIAL,
> + _partial_costs);
>
> IIUC, the costs for evaluating aggregates would not change per child. They
> won't be different for parent and any of the children. So, we should be
> able to
> calculate those once, save in "extra" and use repeatedly.
>

Yep. Done.


>
> +if (can_sort)
> +{
> +Path   *path = cheapest_path;
> +
> +if (!(pathkeys_contained_in(root->group_pathkeys,
> +path->pathkeys)))
> [ .. clipped patch .. ]
> +   NIL,
> +   dNumGroups));
> +}
>
> We create two kinds of paths partial paths for parallel query and partial
> aggregation paths when group keys do not have partition keys. The comments
> and
> code uses partial to mean both the things, which is rather confusing. May
> be we
> should use term "partial aggregation" explicitly wherever it means that in
> comments and in variable names.
>

Agree. Used "partial aggregation" wherever applicable. Let me know if you
see any other place need this adjustments.


> I still feel that create_grouping_paths() and create_child_grouping_paths()
> have a lot of common code. While I can see that there are some pockets in
> create_grouping_paths() which are not required in
> create_child_grouping_paths()
> and vice-versa, may be we should create only one function and move those
> pockets under "if (child)" or "if (parent)" kind of conditions. It will be
> a
> maintenance burden to keep these two functions in sync in future. If we do
> not
> keep them in sync, that will introduce bugs.
>

Agree that keeping these two functions in sync in future will be a
maintenance burden, but I am not yet sure how to refactor them cleanly.
Will give one more try and update those changes in the next patchset.


> +
> +/*
> + * create_partition_agg_paths
> + *
> + * Creates append path for all the partitions and adds into the grouped
> rel.
>
> I think you want to write "Creates an append path containing paths from
> all the
> child grouped rels and adds into the given parent grouped rel".
>

Reworded as you said.


>
> + * For partial mode we need to add a finalize agg node over append path
> before
> + * adding a path to grouped rel.
> + */
> +void
> +create_partition_agg_paths(PlannerInfo *root,
> +   RelOptInfo *grouped_rel,
> +   List *subpaths,
> +   List *partial_subpaths,
>
> Why do we have these two as separate arguments? I don't see any call to
> create_partition_agg_paths() through add_append_path() passing both of them
> non-NULL simultaneously. May be you want use a single list subpaths and
> another
> boolean indicating whether it's list of partial paths or regular paths.
>
>
After redesigning in the area of putting gather over append, I don't need
to pass all Append subpaths to this function at-all. Append is done by
add_paths_to_append_rel() itself. This function now just adds fanalization
steps as needed.
So, we don't have two lists now. And to know about partial paths, passed a
boolean instead. Please have a look and let me know if I missed any.


> +
> +/* For non-partial path, just create a append path and we are done. */
> This is the kind of confusion, I am talking about above. Here you have
> mentioned "non-partial path" which may mean a regular path but what you
> actually mean by that term is a path representing partial aggregates.
>

> +/*
> + * Partial partition-wise grouping paths.  Need to create final
> + * aggregation path over append relation.
> + *
> + * If there are partial subpaths, then we need to add gather path
> before we
> + * append these subpaths.
>
> More confusion here.
>

Hopefully no more confusion in this new version.


> + */
> +if 

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-01-01 Thread Mithun Cy
On Tue, Dec 19, 2017 at 5:52 AM, Masahiko Sawada  wrote:
> On Mon, Dec 18, 2017 at 2:04 PM, Masahiko Sawada  
> wrote:
>> On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas  wrote:
>>>
>>> I have to admit that result is surprising to me.
>>
>> I think the environment I used for performance measurement did not
>> have enough resources. I will do the same benchmark on an another
>> environment to see if it was a valid result, and will share it.
>>
> I did performance measurement on an different environment where has 4
> cores and physically separated two disk volumes. Also I've change the
> benchmarking so that COPYs load only 300 integer tuples which are not
> fit within single page, and changed tables to unlogged tables to
> observe the overhead of locking/unlocking relext locks.

I ran same test as asked by Robert it was just an extension of tests
[1] pointed by Amit Kapila,

Machine : cthulhu

Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   1064.000
CPU max MHz:   2129.
CPU min MHz:   1064.
BogoMIPS:  4266.59
Virtualization:VT-x
Hypervisor vendor: vertical
Virtualization type:   full
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0-7,64-71
NUMA node1 CPU(s): 8-15,72-79
NUMA node2 CPU(s): 16-23,80-87
NUMA node3 CPU(s): 24-31,88-95
NUMA node4 CPU(s): 32-39,96-103
NUMA node5 CPU(s): 40-47,104-111
NUMA node6 CPU(s): 48-55,112-119
NUMA node7 CPU(s): 56-63,120-127

It has 2 discs with different filesytem as below
/dev/mapper/vg_mag-data2ext4  5.1T  3.6T  1.2T  76% /mnt/data-mag2
/dev/mapper/vg_mag-data1xfs   5.1T  1.6T  3.6T  31% /mnt/data-mag

I have created 2 tables each one on above filesystem.

test_size_copy.sh --> automated script to run copy test.
copy_script1, copy_script2 -> copy pg_bench script's used by
test_size_copy.sh to load to 2 different tables.

To run above copy_scripts in parallel I have run it with equal weights as below.
./pgbench -c $threads -j $threads -f copy_script1@1 -f copy_script2@1
-T 120 postgres >> test_results.txt


Results :
---

ClientsHEAD-TPS
----
184.460734
2121.359035
4175.886335
8268.764828
16  369.996667
32  439.032756
64  482.185392


ClientsN_RELEXTLOCK_ENTS = 1024%diff with DEAD
--
187.1657773.20272258112273
2131.0940378.02165409439848
4181.6671043.2866504381935
8267.412856-0.503031594595423
16376.1186711.65461058058666
32460.7563574.94805927419228
64492.7239752.18558736428913

Not much of an improvement from HEAD

ClientsN_RELEXTLOCK_ENTS = 1%diff with HEAD
-
186.2885742.16412990206786
2131.3986678.27266960387414
4168.681079-4.09654109854526
8245.841999-8.52895416806549
16321.972147-12.9797169226933
32375.783299-14.4065462395703
64360.134531-25.3120196142317


So in case of  N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?


[1]https://www.postgresql.org/message-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA%40mail.gmail.com
-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


test_size_copy.sh
Description: Bourne shell script


copy_script1
Description: Binary data


copy_script2
Description: Binary data


Re: [HACKERS] UPDATE of partition key

2018-01-01 Thread David Rowley
On 23 December 2017 at 04:00, Amit Khandekar  wrote:
> On 15 December 2017 at 18:28, Robert Haas  wrote:
>> -PartitionDispatch **pd,
>> -ResultRelInfo ***partitions,
>> -TupleConversionMap ***tup_conv_maps,
>> -TupleTableSlot **partition_tuple_slot,
>> -int *num_parted, int *num_partitions)
>> +PartitionTupleRouting **partition_tuple_routing)
>>
>> Since we're consolidating all of ExecSetupPartitionTupleRouting's
>> output parameters into a single structure, I think it might make more
>> sense to have it just return that value.  I think it's only done with
>> output parameter today because there are so many different things
>> being produced, and we can't return them all.
>
> You mean ExecSetupPartitionTupleRouting() will return the structure
> (not pointer to structure), and the caller will get the copy of the
> structure like this ? :
>
> mtstate->mt_partition_tuple_routing =
> ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);
>
> I am ok with that, but just wanted to confirm if that is what you are
> saying. I don't recall seeing a structure return value in PG code, so
> not sure if it is conventional in PG to do that. Hence, I am somewhat
> inclined to keep it as output param. It also avoids a structure copy.
>
> Another way is for ExecSetupPartitionTupleRouting() to palloc this
> structure, and return its pointer, but then caller would have to
> anyway do a structure copy, so that's not convenient, and I don't
> think you are suggesting this way either.

I'm pretty sure Robert is suggesting that
ExecSetupPartitionTupleRouting pallocs the memory for the structure,
sets it up then returns a pointer to the new struct. That's not very
unusual. It seems unusual for a function to return void and modify a
single parameter pointer to get the value to the caller rather than
just to return that value.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Unimpressed with pg_attribute_always_inline

2018-01-01 Thread Thomas Munro
On Tue, Jan 2, 2018 at 4:17 PM, Tom Lane  wrote:
>  baiji | .\src\backend\executor\nodeHashjoin.c(165): warning C4141: 
> 'inline' : used more than once
>  bowerbird | src/backend/executor/nodeHashjoin.c(165): warning C4141: 
> 'inline' : used more than once 
> [c:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
>  currawong | .\src\backend\executor\nodeHashjoin.c(165): warning C4141: 
> 'inline' : used more than once
>  gaur  | nodeHashjoin.c:167: warning: `always_inline' attribute directive 
> ignored
>  mastodon  | .\src\backend\executor\nodeHashjoin.c(165): warning C4141: 
> 'inline' : used more than once
>  thrips| src/backend/executor/nodeHashjoin.c(165): warning C4141: 
> 'inline' : used more than once 
> [C:\buildfarm\buildenv\HEAD\pgsql.build\postgres.vcxproj]
>  woodlouse | src/backend/executor/nodeHashjoin.c(165): warning C4141: 
> 'inline' : used more than once 
> [C:\buildfarm\buildenv\HEAD\pgsql.build\postgres.vcxproj]

So that's two compilers:

1. MSVC doesn't like you to say both "__forceinline" and "inline".

2.  GCC 2.95.3 doesn't understand always_inline.  From a quick look at
archived manuals, it seems that that attribute arrived in 3.1.

It may be that "inline" can be removed (that seems to work OK for me
on clang, but I didn't check GCC).  Not sure off-hand how best to
tackle the ancient GCC problem; maybe a configure test or maybe a GCC
version test.  I will look into those problems.

> In the second place, what I read in gcc's manual about the meaning of
> the always_inline directive is
>
> `always_inline'
>  Generally, functions are not inlined unless optimization is
>  specified.  For functions declared inline, this attribute inlines
>  the function even if no optimization level was specified.
>
> I entirely reject the notion that we should be worried about optimizing
> performance in -O0 builds.  In fact, if someone is building with -O0,
> it's likely the case that they are hoping for exact correspondence
> of source lines to object code, and thus forcing inline is defeating
> their purpose.  I've certainly found plenty of times that inlining
> makes it harder to follow things in a debugger.
>
> Therefore, I think that pg_attribute_always_inline is not merely
> useless but actively bad, and should be removed.

My intention was to make sure it really did get inlined at higher
optimisation levels even though the compiler wouldn't otherwise choose
to do that in a couple of special cases, not to force inlining even at
-O0.  Not sure how to achieve the first of those things without the
second.  I wonder if there is a better way.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Race to build pg_isolation_regress in "make -j check-world"

2018-01-01 Thread Noah Misch
On Sat, Dec 16, 2017 at 08:06:05PM -0800, Noah Misch wrote:
> On Mon, Nov 06, 2017 at 12:07:52AM -0800, Noah Misch wrote:

> I now see similar trouble from multiple
> "make" processes running "make -C contrib/test_decoding install" concurrently.
> This is a risk for any directory named in an EXTRA_INSTALL variable of more
> than one makefile.  Under the right circumstances, this would affect
> contrib/hstore and others in addition to contrib/test_decoding.  That brings
> me back to the locking idea:
> 
> > The problem of multiple "make" processes in a directory (especially 
> > src/port)
> > shows up elsewhere.  In a cleaned tree, "make -j -C src/bin" or "make -j
> > installcheck-world" will do it.  For more-prominent use cases, src/Makefile
> > prevents this with ".NOTPARALLEL:" and building first the directories that 
> > are
> > frequent submake targets.  Perhaps we could fix the general problem with
> > directory locking; targets that call "$(MAKE) -C FOO" would first sleep 
> > until
> > FOO's lock is available.  That could be tricky to make robust.
> 
> If one is willing to assume that a lock-holding process never crashes, locking
> in a shell script is simple: mkdir to lock, rmdir to unlock.  I don't want to
> assume that.  The bakery algorithm provides convenient opportunities for
> checking whether the last locker crashed; I have attached a shell script
> demonstrating this approach.  Better ideas?  Otherwise, I'll look into
> integrating this design into the makefiles.

Performance has been the principal snare.  I value "make -j" being fast when
there's little to rebuild, but that shell script approach slowed an empty
build by 340% (GNU/Linux w/ SSD) to 2300% (Cygwin).  In a build having nothing
to do, merely adding a no-op wrapper around "make -C" (does nothing but
execvp() the real GNU make) slowed the build by over 10%[1].  To get what I
considered decent performance took several design changes:

1. Replace the shell script implementation of the bakery algorithm with a C
   program that issues fcntl(F_SETLK) on the target directory's makefile.

2. Stop re-entering widely-referenced directories like src/common and src/port
   dozens of times per build.  Instead, enter them before any "make -C", then
   assume they're built if $(MAKELEVEL) is nonzero.  This reduced the total
   number of "make -C" calls and lock acquisitions from 276 to 147.

3. Lock only directories known to be entered more than once per top-level
   target.  Preliminarily, this reduced the 147 lock acquisitions to 25.

I regret that (3) entails ongoing maintenance of a whitelist of such
directories; adding a "make -C" call can expand the list.  However, I can
automate verification of that whitelist.  If I abandon (3), an empty build in
an NFS-mounted source directory slows from 6.4s to 22s; with all three design
changes intact, it slows to 6.6s.  I think that justifies keeping (3).

I considered abandoning (1).  An empty build would then slow from 6.6s to 7.7s
on NFS-mounted sources, and it would slow from 2.1s to 6.8s on Cygwin.  (I
expect similar on MSYS.)  That is perhaps acceptable.  It would save a few
lines of code and perhaps avoid portability snags.  Unlike abandoning (3), I
would not expect long-term maintenance savings.  For systems with low PID
entropy[2], (1) also eliminates unreliable code for detecting that a lock
holder has died.  I plan to keep all three design changes.


The non-performance problems that arose were easy to fix:

a. In addition to avoiding overlapping "make install" invocations in each
   directory, we must ensure no directory is installed more than once per
   tmp_install.  Otherwise, the second "make install" could unlink or
   overwrite a file while another process runs a test that uses the file.  I
   fixed this with stamp files like we use elsewhere in Makefile.global.

b. Our "make -C" call graph has cycles[3].  To avoid deadlocks, I accumulated
   lock-holding PIDs in an environment variable.  If the holder of a desired
   lock appears in that variable, then that holder is sleeping until after the
   present process completes.  We can then proceed without a lock acquisition.

Comments?  Otherwise, I'll look at finishing the patch series.

Thanks,
nm


[1] I was surprised to see that GNU make is this efficient at determining
there's nothing to rebuild.

[2] For example, Cygwin reuses PIDs as often as every 645 processes.  Also,
Cygwin processes have both a Cygwin PID and a Windows PID, and kill() can
match either PID.  The shell script relied on kill(PID, 0) to deduce that
a lock holder had died.  We may have ended up needing a secondary factor,
like process start time, on such systems.

[3] For example, src/backend depends on libpgport_srv.a, and src/port depends
on submake-errcodes of src/backend.



Re: What does Time.MAX_VALUE actually represent?

2018-01-01 Thread Gavin Flower

On 01/02/2018 01:26 AM, Tels wrote:

Moin,

On Sat, December 30, 2017 4:25 pm, Gavin Flower wrote:

On 12/31/2017 03:07 AM, Dave Cramer wrote:

We are having a discussion on the jdbc project about dealing with
24:00:00.

https://github.com/pgjdbc/pgjdbc/pull/992#issuecomment-354507612

Dave Cramer

In Dublin (I was there 2001 to 2004), Time tables show buses just after
midnight, such as 1:20am as running at the time 2520 - so there are
visible close to the end of the day.  If you are looking for buses
around midnight this is very user friendly - better than looking at the
other end of the time table for 0120.

I think logically that 24:00:00 is exactly one day later than 00:00:00 -
but I see from following the URL, that there are other complications...

Careful here, if "24:00:00" always means literally "00:00:00 one day
later", that could work, but you can't just have it meaning "add 24 hours
to the clock".

For instance, during daylight saving time changes, days can be 23 hours or
25 hours long...

Best wishes,

Tels


Agreed, I'm thinking purely of displayed time.  Where the utility of 
using times like 2400 and 2530 is purely the convenience of people 
looking to catching a bus after a late night out.


The 24:00 time should be referred to in a similar way to the notation of 
'0+' in limits (that is informally defined as the smallest positive real 
number -- formally that is nonsense, 'lim 0+' actually means approach 
the limit from the positive direction).



Cheers,
Gavin




Re: [Patch] Checksums for SLRU files

2018-01-01 Thread Andrey Borodin
Hi, Ivan!
> 31 дек. 2017 г., в 22:30, Ivan Kartyshov  
> написал(а):
> 
> Hello, I`d like to show my implementation of SLRU file protection with 
> checksums.
> .
> I would like to hear your thoughts over my patch.

As far as I can see, the patch solves problem of hardware corruption in SLRU.
This seems a valid concern. I've tried to understand your patch and few 
questions arose which I could not answer myself.

1. Why do you propose different GUC besides ignore_checksum_failure? What is 
scenario of it's use which is not covered by general GUC switch?
2. What is performance penalty of having this checksums?

Besides this, some things seems suspicious to me:
1. This comment seems excessive. I'd leave just first one first line.
+/*
+ * GUC variable
+ * Set from backend:
+ * alter system set ignore_slru_checksum_failure = on/off;
+ * select pg_reload_conf();
+ */
2. Functions pg_checksum_slru_page() and pg_getchecksum_slru_page() operate 
with two bytes instead of uint16. This seems strange.
3. This line
checksum = (pg_checksum_block(page, BLCKSZ) % 65535) + 1;
Need to share comment with previous function (pg_checksum_page()). +1 was a 
tough thing for me to understand before looking around and reading those 
comments.
4. I could not understand purpose of this expression
page[BLCKSZ - 1] & 0X00FF

Happy New Year :) Keep up good work.

Best regards, Andrey Borodin.


Re: Increasing timeout of poll_query_until for TAP tests

2018-01-01 Thread Noah Misch
On Mon, Jan 01, 2018 at 07:55:37PM +0900, Michael Paquier wrote:
> On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote:
> > Since now() is transaction_timestamp(), $recovery_time precedes or equals
> > $lsn3, and this didn't close the race.  Using clock_timestamp() here would
> > work, as does using separate transactions like recovery-test-fixes.patch 
> > did.
> > I'll shortly push a fix for this and a similar ordering problem in the
> > standby_5 test, which first appeared subsequent to this thread.
> 
> As recovery_target_inclusive is true by default, my conclusion on the
> matter, which was something that my tests on hamster, the now-dead
> buildfarm animal seemed to confirm, is that just getting a timestamp at
> least the value of the LSN from the same transaction was enough to fix
> all the failures. And hamster was really slow. I can follow why
> logically your patch makes sense, so I agree that this is sane. Have you
> spotted failures from the buildfarm?

No, but I checked only the last 90 days.  Earlier master (e.g. git checkout
6078770^) with the following patch reproduces the failures on every run:

--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -71,8 +71,8 @@ my ($lsn2, $recovery_txid) = split /\|/, $ret;
 $node_master->safe_psql('postgres',
"INSERT INTO tab_int VALUES (generate_series(2001,3000))");
 $ret =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn(), now();");
-my ($lsn3, $recovery_time) = split /\|/, $ret;
+  $node_master->safe_psql('postgres', "SELECT pg_sleep(80), 
pg_current_wal_lsn(), now();");
+my ($delay_for_autovacuum, $lsn3, $recovery_time) = split /\|/, $ret;
 
 # Even more data, this time with a recovery target name
 $node_master->safe_psql('postgres',
@@ -88,6 +88,7 @@ $node_master->safe_psql('postgres',
"INSERT INTO tab_int VALUES (generate_series(4001,5000))");
 my $recovery_lsn =
   $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_master->safe_psql('postgres', 'VACUUM');  # write some WAL
 my $lsn5 =
   $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
 



Re: [HACKERS] UPDATE of partition key

2018-01-01 Thread Amit Khandekar
On 16 December 2017 at 03:09, Robert Haas  wrote:
> started another review pass over the main patch, so here are
> some comments about that.

I am yet to address all the comments, but meanwhile, below are some
specific points  ...

> + if (!partrel)
> + {
> + /*
> + * We locked all the partitions above including the leaf
> + * partitions. Note that each of the newly opened relations in
> + * *partitions are eventually closed by the caller.
> + */
> + partrel = heap_open(leaf_oid, NoLock);
> + InitResultRelInfo(leaf_part_rri,
> +   partrel,
> +   resultRTindex,
> +   rel,
> +   estate->es_instrument);
> + }
>
> Hmm, isn't there a problem here?  Before, we opened all the relations
> here and the caller closed them all.  But now, we're only opening some
> of them.  If the caller closes them all, then they will be closing
> some that we opened and some that we didn't.  That seems quite bad,
> because the reference counts that are incremented and decremented by
> opening and closing should all end up at 0.  Maybe I'm confused
> because it seems like this would break in any scenario where even 1
> relation was already opened and surely you must have tested that
> case... but if there's some reason this works, I don't know what it
> is, and the comment doesn't tell me.

In ExecCleanupTupleRouting(), we are closing only those newly opened
partitions. We skip those which are actually part of the update result
rels.

> + /*
> + * UPDATEs set the transition capture map only when a new subplan
> + * is chosen.  But for INSERTs, it is set for each row. So after
> + * INSERT, we need to revert back to the map created for UPDATE;
> + * otherwise the next UPDATE will incorrectly use the one created
> + * for INESRT.  So first save the one created for UPDATE.
> + */
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> I wonder if there is some more elegant way to handle this problem.
> Basically, the issue is that ExecInsert() is stomping on
> mtstate->mt_transition_capture, and your solution is to save and
> restore the value you want to have there.  But maybe we could instead
> find a way to get ExecInsert() not to stomp on that state in the first
> place.  It seems like the ON CONFLICT stuff handled that by adding a
> second TransitionCaptureState pointer to ModifyTable, thus
> mt_transition_capture and mt_oc_transition_capture.  By that
> precedent, we could add mt_utr_transition_capture or similar, and
> maybe that's the way to go.  It seems a bit unsatisfying, but so does
> what you have now.

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

> + * If per-leaf map is required and the map is already created, that map
> + * has to be per-leaf. If that map is per-subplan, we won't be able to
> + * access the maps leaf-partition-wise. But if the map is per-leaf, we
> + * will be able to access the maps subplan-wise using the
> + * subplan_partition_offsets map using function
> + * tupconv_map_for_subplan().  So if the callers might need to access
> + * the map both leaf-partition-wise and subplan-wise, they should make
> + * sure that the first time this function is called, it should be
> + * called with perleaf=true so that the map created is per-leaf, not
> + * per-subplan.
>
> This sounds complicated and fragile.  It ends up meaning that
> mt_childparent_tupconv_maps is sometimes indexed by subplan number and
> sometimes by partition leaf index, which is extremely confusing and
> likely to lead to coding errors, either in this patch or in future
> ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan index. Only thing that will change is :
tupconv_map_for_subplan() will not have to check if the the map is
indexed by leaf partition or not. But that complexity is hidden in
this function alone; the outside code need not worry about that.

If the access is by 

Re: ALTER TABLE ADD COLUMN fast default

2018-01-01 Thread Andrew Dunstan


On 12/31/2017 06:48 PM, Andrew Dunstan wrote:
>
> Here is a new version of the patch. It separates the missing values
> constructs and logic from the default values constructs and logic. The
> missing values now sit alongside the default values in tupleConstr. In
> some places the separation makes the logic a good bit cleaner.
>
> Some of the logic is also reworked to remove an assumption that the
> missing values structure is populated in attnum order, Also code to
> support the missing stuctures is added to equalTupleDescs.
>
> We still have that one extra place where we need to call
> CreateTupleDescCopyConstr instead of CreateTupleDescCopy. I haven't seen
> an easy way around that.
>


New version of the patch that fills in the remaining piece in
equalTupleDescs.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3f02202..d5dc14a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1150,6 +1150,18 @@
  
 
  
+  atthasmissing
+  bool
+  
+  
+This column has a value which is used where the column is entirely
+missing from the row, as happens when a column is added after the
+row is created. The actual value used is stored in the
+attmissingval  column.
+  
+ 
+
+ 
   attidentity
   char
   
@@ -1229,6 +1241,18 @@
   
  
 
+ 
+  attmissingval
+  bytea
+  
+  
+This column has a binary representation of the value used when the column
+is entirely missing from the row, as happens when the column is added after
+the row is created. The value is only used when
+atthasmissing is true.
+  
+ 
+
 

   
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 7bcf242..780bead 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -1115,26 +1115,28 @@ ALTER TABLE [ IF EXISTS ] name

 

-When a column is added with ADD COLUMN, all existing
-rows in the table are initialized with the column's default value
-(NULL if no DEFAULT clause is specified).
-If there is no DEFAULT clause, this is merely a metadata
-change and does not require any immediate update of the table's data;
-the added NULL values are supplied on readout, instead.
+ When a column is added with ADD COLUMN and a
+ non-volatile DEFAULT is specified, the default
+ is evaluated at the time of the statement and the result stored in the
+ table's metadata. That value will be used for the column for
+ all existing rows. If no DEFAULT is specified
+ NULL is used. In neither case is a rewrite of the table required.

 

-Adding a column with a DEFAULT clause or changing the type of
-an existing column will require the entire table and its indexes to be
-rewritten.  As an exception when changing the type of an existing column,
-if the USING clause does not change the column
-contents and the old type is either binary coercible to the new type or
-an unconstrained domain over the new type, a table rewrite is not needed;
-but any indexes on the affected columns must still be rebuilt.  Adding or
-removing a system oid column also requires rewriting the entire
-table.  Table and/or index rebuilds may take a significant amount of time
-for a large table; and will temporarily require as much as double the disk
-space.
+ Adding a column with a volatile DEFAULT or
+ changing the type of
+ an existing column will require the entire table and its indexes to be
+ rewritten.  As an exception when changing the type of an existing column,
+ if the USING clause does not change the column
+ contents and the old type is either binary coercible to the new type or
+ an unconstrained domain over the new type, a table rewrite is not needed;
+ but any indexes on the affected columns must still be rebuilt.  Adding or
+ removing a system oid column also requires rewriting
+ the entire table.
+ Table and/or index rebuilds may take a significant amount of time
+ for a large table; and will temporarily require as much as double the disk
+ space.

 

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index a1a9d99..6aab16f 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -76,6 +76,116 @@
  * 
  */
 
+/*
+ * Return the missing value of an attribute, or NULL if it
+ * there is no missing value.
+ */
+static Datum
+getmissingattr(TupleDesc tupleDesc,
+			   int attnum, bool *isnull)
+{
+	int			missingnum;
+	Form_pg_attribute att;
+	AttrMissing *attrmiss;
+
+	

Re: [HACKERS] Runtime Partition Pruning

2018-01-01 Thread David Rowley
On 1 January 2018 at 19:22, Beena Emerson  wrote:
> I think you are testing without asserts

Yeah, I was indeed. Oops.

> The following assert fails: src/backend/optimizer/plan/setrefs.c :
> set_plan_refs: ln 921
> Assert(splan->plan.qual == NIL);
> Append node now has runtime partition quals.
>
> Also since the valid subplans are set in ExecInitAppend, the queries
> with Init Plans do not work. I had moved it to ExecAppend in my patch
> to handle the InitPlans as well.

Thanks for noticing. I've now changed things around so this case works
as it should and I've added a test too.

I've attached an updated patch which also fixes a number of other
problems with my previous patch.

1. The Bitmapset I was using in nodeAppend.c to mark the valid
subplans was pretty bogus for Parallel Append since the memory for the
set was not in shared memory. I changed things around to reuse the
pa_finished[] array and the patch just now sets pa_finished to true
for any invalid subplans.
2. I've added a new memory context to use in nodeAppend.c which is
used to call the planner code to determine which partitions are valid.
I'd been trying to have Amit be careful to pfree() everything in his
v17 patch, but I realised it was just not possible to get everything
pfree'd. I found it pretty easy to construct a test case which caused
an OOM.
3. I've added support for IN lists to be pruned when the IN() list
contains a parameter. The changes I made to support this case probably
mostly belong in Amit's faster partition pruning patch, but I've put
them here for now to get this case working. There's a bunch of new
tests to test this.
4. Various other cosmetic improvements.

The attached patch should be applied after patching master with Amit's
v17 faster partition pruning patch [1].

[1] 
https://www.postgresql.org/message-id/58c3e20a-a964-4fdb-4e7d-bd833e9be...@lab.ntt.co.jp

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


runtime_prune_drowley_v5.patch
Description: Binary data


Re: What does Time.MAX_VALUE actually represent?

2018-01-01 Thread Tels
Moin,

On Sun, December 31, 2017 12:50 pm, Tom Lane wrote:
> Peter Eisentraut  writes:
>> select timestamp '2017-12-30 24:00:00';
>> returns
>> 2017-12-31 00:00:00
>> which makes some sense.
>
>> I don't know why we accept that and not '24:00:01' and beyond, but it's
>> probably historical.
>
> We also accept
>
> regression=# select timestamp '2017-12-30 23:59:60';
>   timestamp
> -
>  2017-12-31 00:00:00
> (1 row)
>
> which is maybe a little bit more defensible as a common behavior
> to allow for leap seconds.  (It's far from a correct implementation of
> leap seconds, of course, but I think most versions of mktime() and
> friends do likewise.)
>
> Digging into the git history, it looks like this choice dates to
>
> commit a93bf4503ffc6d7cd6243a6324fb2ef206b10adf
> Author: Bruce Momjian 
> Date:   Fri Oct 14 11:47:57 2005 +
>
> Allow times of 24:00:00 to match rounding behavior:
>
> regression=# select '23:59:59.9'::time(0);
>time
> --
>  24:00:00
> (1 row)
>
> This is bad because:
>
> regression=# select '24:00:00'::time(0);
> ERROR:  date/time field value out of range: "24:00:00"
>
> The last example now works.
>
>
> There may at the time have been some argument about imprecision of
> float timestamps involved, but it works the same way today once
> you exceed the precision of our integer timestamps:
>
> regression=# select time '23:59:59.99';
>   time
> -
>  23:59:59.99
> (1 row)
>
> regression=# select time '23:59:59.999';
>time
> --
>  24:00:00
> (1 row)
>
> If we didn't allow '24:00:00' as a valid value then we'd need to
> throw an error for '23:59:59.999', which doesn't seem nice.

Hm, but shouldn't the result then be "00:00:00" instead of "24:00:00"?

With addition it seems to work different:

postgres=# select time '23:59:59.99' + interval '0.01 seconds';
 ?column?
--
 00:00:00
(1 row)

Best regards,

Tels



Re: What does Time.MAX_VALUE actually represent?

2018-01-01 Thread Tels
Moin,

On Sat, December 30, 2017 4:25 pm, Gavin Flower wrote:
> On 12/31/2017 03:07 AM, Dave Cramer wrote:
>> We are having a discussion on the jdbc project about dealing with
>> 24:00:00.
>>
>> https://github.com/pgjdbc/pgjdbc/pull/992#issuecomment-354507612
>>
>> Dave Cramer
>
> In Dublin (I was there 2001 to 2004), Time tables show buses just after
> midnight, such as 1:20am as running at the time 2520 - so there are
> visible close to the end of the day.  If you are looking for buses
> around midnight this is very user friendly - better than looking at the
> other end of the time table for 0120.
>
> I think logically that 24:00:00 is exactly one day later than 00:00:00 -
> but I see from following the URL, that there are other complications...

Careful here, if "24:00:00" always means literally "00:00:00 one day
later", that could work, but you can't just have it meaning "add 24 hours
to the clock".

For instance, during daylight saving time changes, days can be 23 hours or
25 hours long...

Best wishes,

Tels



Re: Increasing timeout of poll_query_until for TAP tests

2018-01-01 Thread Michael Paquier
On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote:
> Since now() is transaction_timestamp(), $recovery_time precedes or equals
> $lsn3, and this didn't close the race.  Using clock_timestamp() here would
> work, as does using separate transactions like recovery-test-fixes.patch did.
> I'll shortly push a fix for this and a similar ordering problem in the
> standby_5 test, which first appeared subsequent to this thread.

As recovery_target_inclusive is true by default, my conclusion on the
matter, which was something that my tests on hamster, the now-dead
buildfarm animal seemed to confirm, is that just getting a timestamp at
least the value of the LSN from the same transaction was enough to fix
all the failures. And hamster was really slow. I can follow why
logically your patch makes sense, so I agree that this is sane. Have you
spotted failures from the buildfarm? This is hard to spot from any
people which just have access to what the buildfarm UI offers, so any
reference to failures on this thread would be welcome.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-01-01 Thread Ashutosh Bapat
On Thu, Dec 28, 2017 at 11:08 AM, Masahiko Sawada  wrote:
>
>> (1)
>> Why don't you use the existing global variable MyXactFlags instead of the 
>> new TransactionDidWrite?  Or, how about using XactLastRecEnd != 0 to 
>> determine the transaction did any writes?  When the transaction only 
>> modified temporary tables on the local database and some data on one remote 
>> database, I think 2pc is unnecessary.
>
> Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if
> we did any writes on local node which requires the atomic commit. Will
> fix.
>

I haven't checked how much code it needs to track whether the local
transaction wrote anything. But probably we can post-pone this
optimization. If it's easy to incorporate, it's good to have in the
first set itself.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company