Re: [HACKERS] snapbuild woes

2016-12-10 Thread Craig Ringer
On 11 Dec. 2016 06:50, "Petr Jelinek"  wrote:

On 10/12/16 23:10, Petr Jelinek wrote:
>
> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
> behavior so that we don't make snapshots for transactions that we seen
> wholly and know that they didn't make catalog changes even if we didn't
> reach consistency yet.
>

Eh, attached wrong patch. This one is correct.


Attached no patch second time?


Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-10 Thread Craig Ringer
On 11 Dec. 2016 07:44, "Tom Lane"  wrote:


I think we need to do at least this much for v10, because otherwise
we'll face ABI issues if an extension is compiled against code with
one semaphore API choice and used with code with a different one.


+1, this is a good idea. Your performance comments make sense too.


Re: [HACKERS] [sqlsmith] Short reads in hash indexes

2016-12-10 Thread Amit Kapila
On Thu, Dec 8, 2016 at 2:38 AM, Andreas Seltenreich  wrote:
> Andreas Seltenreich writes:
>
>> Amit Kapila writes:
>>
>>> On Sat, Dec 3, 2016 at 3:44 PM, Andreas Seltenreich  
>>> wrote:
 Amit Kapila writes:

> [2. text/x-diff; fix_hash_bucketsplit_sqlsmith_v1.patch]
 Ok, I'll do testing with the patch applied.
>>
>> Good news: the assertion hasn't fired since the patch is in.
>
> Meh, it fired again today after being silent for 100e6 queries :-/
> I guess I need to add some confidence qualification on such statements.
> Maybe sigmas as they do at CERN…
>

This assertion can be reproduced with Jeff's test as well and the fix
for the same is posted [1].

>> smith=# select * from state_report where sqlstate = 'XX001';
>> -[ RECORD 1 
>> ]--
>> count| 10
>> sqlstate | XX001
>> sample   | ERROR:  could not read block 1173 in file "base/16384/17256": 
>> read only 0 of 8192 bytes
>> hosts| {airbisquit,frell,gorgo,marbit,pillcrow,quakken}
>>
>>> Hmm, I am not sure if this is related to previous problem, but it
>>> could be.  Is it possible to get the operation and or callstack for
>>> above failure?
>>
>> Ok, will turn the elog into an assertion to get at the backtraces.
>
> Doing so on top of 4212cb7, I caught the backtrace below.  Query was:
>
> --8<---cut here---start->8---
> set max_parallel_workers_per_gather = 0;
> select  count(1) from
>public.hash_name_heap as ref_2
>join public.rtest_emplog as sample_1
>   on (ref_2.random = sample_1.who);
> --8<---cut here---end--->8---
>
> I've put the data directory where it can be reproduced here:
>
> http://ansel.ydns.eu/~andreas/hash_index_short_read.tar.xz (12MB)
>

This can happen due to non-marking of the dirty buffer as the index
page where we have deleted the tuples will not be flushed whereas
vacuum would have removed corresponding heap tuples.  Next access to
hash index page will bring back the old copy of index page which
contains tuples that were supposed to get deleted by vacuum and
accessing those tuples will give wrong information about heap tuples
and when we try to access deleted heap tuples, it can give us short
reads problem.

Can you please try with the patch posted on hash index thread [1] to
see if you can reproduce any of these problems?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1Kf6tOY0oVz_SEdngiNFkeXrA3xUSDPPORQvsWVPdKqnA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-10 Thread Amit Kapila
On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes  wrote:
> On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila  wrote:
>>
>> On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes  wrote:
>> >
>> >
>> > I just occasionally insert a bunch of equal tuples, which have to be in
>> > overflow pages no matter how much splitting happens.
>> >
>> > I am getting vacuum errors against HEAD, after about 20 minutes or so (8
>> > cores).
>> >
>> > 49233  XX002 2016-12-05 23:06:44.087 PST:ERROR:  index "foo_index_idx"
>> > contains unexpected zero page at block 64941
>> > 49233  XX002 2016-12-05 23:06:44.087 PST:HINT:  Please REINDEX it.
>> > 49233  XX002 2016-12-05 23:06:44.087 PST:CONTEXT:  automatic vacuum of
>> > table
>> > "jjanes.public.foo"
>> >
>>
>> Thanks for the report.  This can happen due to vacuum trying to access
>> a new page.  Vacuum can do so if (a) the calculation for maxbuckets
>> (in metapage) is wrong or (b) it is trying to free the overflow page
>> twice.  Offhand, I don't see that can happen in code.  I will
>> investigate further to see if there is any another reason why vacuum
>> can access the new page.  BTW, have you done the test after commit
>> 2f4193c3, that doesn't appear to be the cause of this problem, but
>> still, it is better to test after that fix.  I am trying to reproduce
>> the issue, but in the meantime, if by anychance, you have a callstack,
>> then please share the same.
>
>
> It looks like I compiled the code for testing a few minutes before 2f4193c3
> went in.
>
> I've run it again with cb9dcbc1eebd8, after promoting the ERROR in
> _hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to
> feed to gdb.
>
> This time it was an INSERT, not autovac, that got the error:
>

The reason for this and the similar error in vacuum was that in one of
the corner cases after freeing the overflow page and updating the link
for the previous bucket, we were not marking the buffer as dirty.  So,
due to concurrent activity, the buffer containing the updated links
got evicted and then later when we tried to access the same buffer, it
brought back the old copy which contains a link to freed overflow
page.

Apart from above issue, Kuntal has noticed that there is assertion
failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the
same test as provided by you. The reason for that problem was that
after deleting tuples in hashbucketcleanup, we were not marking the
buffer as dirty due to which the old copy of the overflow page was
re-appearing and causing that failure.

After fixing the above problem,  it has been noticed that there is
another assertion failure (Assert(bucket == obucket);) in
_hash_splitbucket_guts.  The reason for this problem was that after
the split, vacuum failed to remove tuples from the old bucket that are
moved due to split. Now, during next split from the same old bucket,
we don't expect old bucket to contain tuples from the previous split.
To fix this whenever vacuum needs to perform split cleanup, it should
update the metapage values (masks required to calculate bucket
number), so that it shouldn't miss cleaning the tuples.
I believe this is the same assertion what Andreas has reported in
another thread [1].

The next problem we encountered is that after running the same test
for somewhat longer, inserts were failing with error "unexpected zero
page at block ..".  After some analysis, I have found that the lock
chain (lock next overflow bucket page before releasing the previous
bucket page) was broken in one corner case in _hash_freeovflpage due
to which insert went ahead than squeeze bucket operation and accessed
the freed overflow page before the link for the same has been updated.

With above fixes, the test ran successfully for more than a day.

Many thanks to Kuntal and Dilip for helping me in analyzing and
testing the fixes for these problems.

[1] - https://www.postgresql.org/message-id/87y3zrzbu5.fsf_-_%40ansel.ydns.eu

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_hashindex_issues_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: psql statements \gstore \gstore_binary (instead COPY RAW)

2016-12-10 Thread Pavel Stehule
2016-12-10 7:11 GMT+01:00 Pavel Stehule :

>
>
> 2016-12-10 2:27 GMT+01:00 Jim Nasby :
>
>> On 12/9/16 9:39 AM, Pavel Stehule wrote:
>>
>>>
>>> SELECT image FROM accounts WHERE id = xxx
>>> \gstore_binary ~/image.png
>>>
>>> What do you think about this proposal?
>>>
>>
>> Seems reasonable.
>>
>> I've lost track at this point... is there a way to go the other direction
>> with that as well? Namely, stick the contents of a file into a field via an
>> INSERT or UPDATE?
>>
>
> a target of this feature is storing only. For import there should be
> another statements.
>
> I am think so there is a consensus (with Tom) on binary passing mode. Some
> like
>
> \set USE_BINARY on
>

I was wrong - the agreement is on passing psql parameters as query
parameters - not on binary mode. Binary mode can be interesting for
importing xml, but it is really corner case.


>
> What is not clean (where is not a agreement is a way how to get a some
> content) - if we use a variables with content (not references), then we can
> or cannot to have special statement
>
> so some ways how to push some binary content to server
>
> A)
> \set  `cat file`
> \set USE_BINARY on
> INSERT INTO tab(id, data) VALUES(1, :::bytea);
>
> B)
> \set  `cat file`
> INSERT INTO tab(id, data) VALUES (1, :x''); -- use bytea escape
>
> C)
> \load_binary  file
> INSERT INTO tab(id, data) VALUES(1, :'');
>
> D)
> \load  file
> \set USE_BINARY on
> INSERT INTO tab(id, data) VALUES(1, :::bytea);
>
> E)
> \set  ``cat file``
> INSERT INTO tab(id, data) VALUES (1, :'');
>
> any from mentioned variants has some advantages - and I don't see a clean
> winner. I like a binary mode for passing - the patch is small and clean and
> possible errors are well readable (not a MBytes of hexa numbers). Passing
> in text mode is safe - although the some errors, logs can be crazy. I would
> to use some form of "load" backslash command ("load", "load_binary"): a) we
> can implement a file tab complete, b) we can hide some platform specific
> ("cat" linux, "type" windows).
>
> Now, only text variables are supported - it is enough for passing XML,
> JSON - but not for binary data (one important variant is passing XML binary
> for automatic XML internal encoding transformation). So we should to encode
> content before storing to variable, or we should to introduce binary
> variables. It is not hard - introduce new functions, current API will
> supports text variables.
>
> The implementation of these variants is short, simple - we can implement
> more than exactly one way - @E is general, but little bit magic, and
> without a autocomplete possibility, @C is very clear
>
> The discussion can be about importance following features:
>
> 1. binary passing  (important for XML, doesn't fill a logs, a speed is not
> important in this context)
> 2. tab complete support
> 3. verbosity, readability
>
> I would to know how these points are important, interesting for other
> people? It can helps with choosing variant or variants that we can
> implement. I don't expect some significant differences in implementation
> complexity of mentioned variants - the code changes will be +/- same.
>
> Regards
>
> Pavel
>
>
>
>>
>> I've done that in the past via psql -v var=`cat file`, but there's
>> obviously some significant drawbacks to that...
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
>> Experts in Analytics, Data Architecture and PostgreSQL
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>> 855-TREBLE2 (855-873-2532)
>>
>
>


Re: [HACKERS] Proposal : Parallel Merge Join

2016-12-10 Thread Dilip Kumar
On Sun, Dec 11, 2016 at 8:14 AM, Peter Geoghegan  wrote:
> I noticed that the partially parallelized Merge Join EXPLAIN ANALYZE
> that you attach has a big misestimation:
>
> ->  Merge Join  (cost=3405052.45..3676948.66 rows=320 width=32)
> (actual time=21165.849..37814.551 rows=1357812 loops=4)
>
> Is this the best test case to show off the patch? This node is the
> immediate outer child of a Nested Loop Semi Join, and so I'm concerned
> that we measuring the wrong thing.

Actually main purpose of showing this example is that, by enabling
parallelism with merge join we can enable the parallelism in complete
tree, and we can get completely different kind of plan which is way
cheaper.

But I completely agree with your point that, in this particular
example estimation is wrong and may be with correct estimation plan
can be entirely different. I am planning to test this with more self
generated test case where first we can guarantee that our estimation
is almost correct and see the impact of the patch.

Here is one such example, this time I tested my patch with Amit's
parallel index scan patch [1], just to enable more scope for parallel
merge join.

All configurations and machines are same what I mentioned up thread.

create table test1(a, b) as select generate_series(1,
1000),generate_series(1, 1000);
create table test2(a, b) as select generate_series(1,
1000),generate_series(1, 1000);
create index idx1 on test1(a);
create index idx2 on test2(a);

Query:
explain analyze select * from test1, test2 where test1.a=test2.a and
test1.b = test2.b and test1.a+test2.b < 3 and test1.a < 300 and
test2.a < 300;

On Head:

 Merge Join  (cost=6.92..228073.90 rows=1 width=16) (actual
time=0.033..3123.400 rows=1 loops=1)
   Merge Cond: (test1.a = test2.a)
   Join Filter: ((test1.b = test2.b) AND ((test1.a + test2.b) < 3))
   Rows Removed by Join Filter: 298
   ->  Index Scan using idx1 on test1  (cost=0.43..98626.90
rows=2998198 width=8) (actual time=0.013..746.382 rows=299
loops=1)
 Index Cond: (a < 300)
   ->  Index Scan using idx2 on test2  (cost=0.43..98708.62
rows=3000639 width=8) (actual time=0.012..751.558 rows=299
loops=1)
 Index Cond: (a < 300)
 Planning time: 0.604 ms
 Execution time: 3123.442 ms

With Patch:

 Gather  (cost=1005.46..193001.64 rows=1 width=16) (actual
time=0.244..1883.087 rows=1 loops=1)
   Workers Planned: 3
   Workers Launched: 3
   ->  Merge Join  (cost=5.46..192000.64 rows=1 width=16) (actual
time=1409.686..1880.394 rows=0 loops=4)
 Merge Cond: (test2.a = test1.a)
 Join Filter: ((test1.b = test2.b) AND ((test1.a + test2.b) < 3))
 Rows Removed by Join Filter: 75
 ->  Parallel Index Scan using idx2 on test2
(cost=0.43..78381.71 rows=967948 width=8) (actual time=0.023..221.172
rows=75 loops=4)
   Index Cond: (a < 300)
 ->  Index Scan using idx1 on test1  (cost=0.43..98626.90
rows=2998198 width=8) (actual time=0.024..893.081 rows=2999254
loops=4)
   Index Cond: (a < 300)
 Planning time: 0.281 ms
 Execution time: 1888.750 ms

This example shows direct improvement with merge join, but IMHO the
main value of this patch is that we can parallelize multi level join
query, where parallelism is not used becuase of merge join at some
level.

[1] 
https://www.postgresql.org/message-id/CAA4eK1KA4LwTYXZG%3Dk3J2bA%2BZOEYnz2gkyWBKgY%3D_q0xJRBMDw%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-12-10 Thread Karl O. Pinc
On Sat, 10 Dec 2016 19:41:21 -0600
"Karl O. Pinc"  wrote:

> On Fri, 9 Dec 2016 23:36:12 -0600
> "Karl O. Pinc"  wrote:
> 
> > Instead I propose (code I have not actually executed):
> > ...
> > charlbuffer[MAXPGPATH];
> > char*log_format = lbuffer;
> > ...
> > 
> > /* extract log format and log file path from the line */
> > log_filepath = strchr(lbuffer, ' ');  /* lbuffer == log_format
> > */ *log_filepath = '\0'; /* terminate log_format */
> > log_filepath++;   /* start of file path */
> > log_filepath[strcspn(log_filepath, "\n")] = '\0';  
> 
> Er, I guess I prefer the more paranoid, just because who knows
> what might have manged to somehow write the file that's read
> into lbuffer:
> 
> ...
> charlbuffer[MAXPGPATH];
> char*log_format = lbuffer;
> ...
> 
> /* extract log format and log file path from the line */
> if (log_filepath = strchr(lbuffer, ' ')) /* lbuffer == log_format
> */ *log_filepath = '\0';/* terminate log_format */
> log_filepath++;  /* start of file path */
> log_filepath[strcspn(log_filepath, "\n")] = '\0';

*sigh*


...
charlbuffer[MAXPGPATH];
char*log_format = lbuffer;
...

/* extract log format and log file path from the line */
/* lbuffer == log_format, they share storage */
if (log_filepath = strchr(lbuffer, ' '))
*log_filepath = '\0';   /* terminate log_format */
else
{
/* Unknown format, no space. Return NULL to caller. */
lbuffer[0] = '\0';
break;
}
log_filepath++;  /* start of file path   */
log_filepath[strcspn(log_filepath, "\n")] = '\0';


> The file read is, of course, normally written by postgres.  But
> possibly writing to unintended memory locations, even virtual address
> NULL, does not seem good.
> 
> Any feedback from more experienced PG developers as how to best handle
> this case would be welcome.
> 
> Regards,
> 
> Karl 
> Free Software:  "You don't pay back, you pay forward."
>  -- Robert A. Heinlein
> 




Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : Parallel Merge Join

2016-12-10 Thread Peter Geoghegan
On Sat, Dec 10, 2016 at 4:59 AM, Dilip Kumar  wrote:
> 3. 20_patch.out  : Explain analyze output with patch (median of 3 runs)

I noticed that the partially parallelized Merge Join EXPLAIN ANALYZE
that you attach has a big misestimation:

->  Merge Join  (cost=3405052.45..3676948.66 rows=320 width=32)
(actual time=21165.849..37814.551 rows=1357812 loops=4)

Is this the best test case to show off the patch? This node is the
immediate outer child of a Nested Loop Semi Join, and so I'm concerned
that we measuring the wrong thing.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-12-10 Thread Karl O. Pinc
On Fri, 9 Dec 2016 23:36:12 -0600
"Karl O. Pinc"  wrote:

> Instead I propose (code I have not actually executed):
> ...
> charlbuffer[MAXPGPATH];
> char*log_format = lbuffer;
> ...
> 
> /* extract log format and log file path from the line */
> log_filepath = strchr(lbuffer, ' ');  /* lbuffer == log_format */
> *log_filepath = '\0'; /* terminate log_format */
> log_filepath++;   /* start of file path */
> log_filepath[strcspn(log_filepath, "\n")] = '\0';

Er, I guess I prefer the more paranoid, just because who knows
what might have manged to somehow write the file that's read
into lbuffer:

...
charlbuffer[MAXPGPATH];
char*log_format = lbuffer;
...

/* extract log format and log file path from the line */
if (log_filepath = strchr(lbuffer, ' ')) /* lbuffer == log_format */
*log_filepath = '\0';/* terminate log_format */
log_filepath++;  /* start of file path */
log_filepath[strcspn(log_filepath, "\n")] = '\0';

The file read is, of course, normally written by postgres.  But possibly
writing to unintended memory locations, even virtual address NULL, does
not seem good.

Any feedback from more experienced PG developers as how to best handle
this case would be welcome.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-12-10 Thread Venkata B Nagothi
Regards,

Venkata B N
Database Consultant


On Fri, Dec 9, 2016 at 11:11 PM, Amit Langote 
wrote:

> On Fri, Dec 9, 2016 at 3:16 PM, Venkata B Nagothi 
> wrote:
> > Hi,
> >
> > I am testing the partitioning feature from the latest master and got the
> > following error while loading the data -
> >
> > db01=# create table orders_y1993 PARTITION OF orders FOR VALUES FROM
> > ('1993-01-01') TO ('1993-12-31');
> > CREATE TABLE
> >
> > db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
> > ERROR:  could not read block 6060 in file "base/16384/16412": read only
> 0 of
> > 8192 bytes
> > CONTEXT:  COPY orders, line 376589:
> > "9876391|374509|O|54847|1997-07-16|3-MEDIUM
>  |Clerk#01993|0|ithely
> > regular pack"
>
> Hmm.   Could you tell what relation the file/relfilenode 16412 belongs to?
>

db01=# select relname from pg_class where relfilenode=16412 ;
   relname
--
 orders_y1997
(1 row)


I VACUUMED the partition and then re-ran the copy command and no luck.

db01=# vacuum orders_y1997;
VACUUM

db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16412": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 376589:
"9876391|374509|O|54847|1997-07-16|3-MEDIUM   |Clerk#01993|0|ithely
regular pack"

I do not quite understand the below behaviour as well. I VACUUMED 1997
partition and then i got an error for 1992 partition and then after 1996
and then after 1994 and so on.

postgres=# \c db01
You are now connected to database "db01" as user "dba".
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16412": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 376589:
"9876391|374509|O|54847|1997-07-16|3-MEDIUM   |Clerk#01993|0|ithely
regular pack"
db01=# vacuum orders_y1997;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 3942 in file "base/16384/16406": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 75445:
"1993510|185287|F|42667.9|1992-08-15|2-HIGH |Clerk#00079|0|
dugouts above the even "
db01=# select relname from pg_class where relfilenode=16406;
   relname
--
 orders_y1992
(1 row)

db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 3942 in file "base/16384/16406": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 75396:
"1993317|260510|F|165852|1992-12-13|5-LOW
 |Clerk#03023|0|regular foxes. ironic dependenc..."
db01=# vacuum orders_y1992;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 3708 in file "base/16384/16394": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 178820:
"4713957|286270|O|200492|1996-10-01|1-URGENT
|Clerk#01993|0|uriously final packages. slyly "
db01=# select relname from pg_class where relfilenode=16394;
   relname
--
 orders_y1996
(1 row)

db01=# vacuum orders_y1996;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 5602 in file "base/16384/16403": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 147390:
"3882662|738010|F|199365|1994-12-26|5-LOW  |Clerk#01305|0|ar
instructions above the expre..."
db01=# select relname from pg_class where relfilenode=16403;
   relname
--
 orders_y1994
(1 row)

db01=# vacuum orders_y1994;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 5561 in file "base/16384/16412": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 59276:
"1572448|646948|O|25658.6|1997-05-02|4-NOT SPECIFIED|Clerk#01993|0|es.
ironic, regular p"


*And finally the error again occurred for 1997 partition*

db01=# select relname from pg_class where relfilenode=16412;
   relname
--
 orders_y1997
(1 row)

db01=# vacuum orders_y1997;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16412": read only 0
of 8192 bytes
CONTEXT:  COPY orders, line 376589:
"9876391|374509|O|54847|1997-07-16|3-MEDIUM   |Clerk#01993|0|ithely
regular pack"
db01=#

Am i not understanding anything here ?


> Also, is orders_y1993 the only partition of orders?  How about \d+ orders?
>

Yes, i created multiple yearly partitions for orders table. I wanted to
1993 year's data first and see if the data goes into orders_y1993 partition
and itseems that, the CSV contains 1997 data as wellCopy command found a

db01=# \d+ orders
  Table "public.orders"
 Column  | Type  | Collation | Nullable | Default |
Storage  | Stats target | Description
-+---+---+--+-+--+--+-
 o_orderkey  | integer   |   |  |

Re: [HACKERS] Effect of caching hash bucket size while costing

2016-12-10 Thread Tom Lane
Robert Haas  writes:
> On Thu, Dec 8, 2016 at 3:53 AM, Srinivas Karthik V
>  wrote:
>> 1) Can you please let me know if innerbucketsize*innerpathrows captures the
>> maximum bucket size?
>> 2) why is it not calculated afresh all the time?

> Well, #2 is answered there right in the comments:

>  * Since we tend to visit the same clauses
> over and over when
>  * planning a large query, we cache the
> bucketsize estimate in the
>  * RestrictInfo node to avoid repeated lookups
> of statistics.

> I assume the person who wrote the comment thought that the answer
> wouldn't change from one call to the next, and therefore it was safe
> to cache.  I don't know why that isn't the case for you.

That was me.  AFAICS, the only way this could change is if virtualbuckets
changes, which would require the results of ExecChooseHashTableSize to
change, which probably means inner_path_rows changed.  So I suspect this
got broken by the introduction of parameterized paths; but there's not
enough info here to confirm whether we're dealing with a parameterized
path or not.

If that is it, I wonder whether we could redefine the cached value so
that it doesn't depend on virtualbuckets.  If not, we could fall back
to only using the cache for nonparameterized inner paths.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] References to arbitrary database objects that are suitable for pg_dump

2016-12-10 Thread Jim Nasby

On 11/26/16 12:30 AM, Jim Nasby wrote:

On 11/25/16 6:00 PM, Tom Lane wrote:

OIDs?  Then use pg_describe_object() to turn that into text when dumping.


How would I get pg_dump to do that?

I could certainly make this work if there was some way to customize how
pg_dump dumps things, but AFAIK there's no way to do that, even with
extensions.


Is there anything that will translate an object identity (as returned by 
pg_event_trigger_ddl_commands) into class/object/subobject OIDs? If I 
had that, I could use it to track objects by OID (like pg_depend does), 
as well as storing the text representation. That way I can have an event 
trigger that updates the stored pg_describe_object() text any time an 
object was renamed. Without that, I don't see any good way to find the 
original name of a renamed object (so that I can update my reference to it).


ISTM it would be useful to add address_names and address_args to 
pg_event_trigger_ddl_commands(), as well as adding some way to get the 
original information for an object when an ALTER is being done. In 
particular, if something (such as a column) is being renamed.


BTW, why were pg_get_object_address and pg_identify_object_as_address 
added instead of creating a function that accepted the output of 
pg_identify_object()? Is there a reason that wouldn't work?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] jacana hung after failing to acquire random number

2016-12-10 Thread Andrew Dunstan


jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make 
check". After starting the autovacuum launcher there are 120 messages on 
its log about "Could not acquire random number". Then nothing.



So I suspect the problem here is commit 
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in 
detail.



Shouldn't we want the postmaster to shut down if it's not going to go 
further? Note that frogmouth, also mingw, which builds with openssl, 
doesn't have this issue.



cheers


andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-10 Thread Tom Lane
Robert Haas  writes:
> On Wed, Dec 7, 2016 at 3:12 PM, Tom Lane  wrote:
>> As things stand, it's only a configure-time choice, but I've been
>> thinking that we might be well advised to make it run-time configurable.

> Sure.  A configure-time choice only benefits people who are compiling
> from source, which as far as production is concerned is almost nobody.

Attached is a proposed patch that unifies the ABI for the three Unix-y
types of semaphores.  I figured it wasn't worth trying to unify Windows
as well, since nobody would ever be doing run-time switching between
Windows and Unix code.

The practical impact of this is that the sem_t or semId/semNum data
is removed from the PGPROC array and placed in a separate array elsewhere
in shared memory.  On 64-bit Linux machines, sem_t is 64 bytes (or at
least, it is on my RHEL6 box), so this change undoes the 56-byte addition
that commit ecb0d20a9 caused.  I think this is probably a performance
win, even though it means an extra indirection to get at the sem_t,
because the speed of access to the sem_t really shouldn't matter: we
only touch that when we're putting a process to sleep or waking it up.
Not bloating PGPROC is probably worth something, though.

It would take additional work to get to being able to do run-time
switching between the Unix semaphore APIs, but that work would now be
localized in posix_sema.c and sysv_sema.c and would not affect code
elsewhere.

I think we need to do at least this much for v10, because otherwise
we'll face ABI issues if an extension is compiled against code with
one semaphore API choice and used with code with a different one.
That didn't use to be a problem because there was really no expectation
of anyone using a non-default semaphore API on any platform, but
I fear it will become an issue if we don't do this.

regards, tom lane

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 2b4b11c..603dc5a 100644
*** a/src/backend/port/posix_sema.c
--- b/src/backend/port/posix_sema.c
***
*** 6,11 
--- 6,19 
   * We prefer the unnamed style of POSIX semaphore (the kind made with
   * sem_init).  We can cope with the kind made with sem_open, however.
   *
+  * In either implementation, typedef PGSemaphore is equivalent to "sem_t *".
+  * With unnamed semaphores, the sem_t structs live in an array in shared
+  * memory.  With named semaphores, that's not true because we cannot persuade
+  * sem_open to do its allocation there.  Therefore, the named-semaphore code
+  * *does not cope with EXEC_BACKEND*.  The sem_t structs will just be in the
+  * postmaster's private memory, where they are successfully inherited by
+  * forked backends, but they could not be accessed by exec'd backends.
+  *
   *
   * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
***
*** 18,45 
  #include "postgres.h"
  
  #include 
  #include 
  #include 
  
  #include "miscadmin.h"
  #include "storage/ipc.h"
  #include "storage/pg_sema.h"
  
  
! #ifdef USE_NAMED_POSIX_SEMAPHORES
! /* PGSemaphore is pointer to pointer to sem_t */
! #define PG_SEM_REF(x)	(*(x))
! #else
! /* PGSemaphore is pointer to sem_t */
! #define PG_SEM_REF(x)	(x)
  #endif
  
  
  #define IPCProtection	(0600)	/* access/modify by user only */
  
  static sem_t **mySemPointers;	/* keep track of created semaphores */
  static int	numSems;			/* number of semas acquired so far */
! static int	maxSems;			/* allocated size of mySemaPointers array */
  static int	nextSemKey;			/* next name to try */
  
  
--- 26,63 
  #include "postgres.h"
  
  #include 
+ #include 
  #include 
  #include 
  
  #include "miscadmin.h"
  #include "storage/ipc.h"
  #include "storage/pg_sema.h"
+ #include "storage/shmem.h"
  
  
! /* see file header comment */
! #if defined(USE_NAMED_POSIX_SEMAPHORES) && defined(EXEC_BACKEND)
! #error cannot use named POSIX semaphores with EXEC_BACKEND
  #endif
  
+ /* typedef PGSemaphore is equivalent to pointer to sem_t */
+ typedef struct PGSemaphoreData
+ {
+ 	sem_t		pgsem;
+ } PGSemaphoreData;
+ 
+ #define PG_SEM_REF(x)	(&(x)->pgsem)
  
  #define IPCProtection	(0600)	/* access/modify by user only */
  
+ #ifdef USE_NAMED_POSIX_SEMAPHORES
  static sem_t **mySemPointers;	/* keep track of created semaphores */
+ #else
+ static PGSemaphore sharedSemas; /* array of PGSemaphoreData in shared memory */
+ #endif
  static int	numSems;			/* number of semas acquired so far */
! static int	maxSems;			/* allocated size of above arrays */
  static int	nextSemKey;			/* next name to try */
  
  
*** PosixSemaphoreKill(sem_t * sem)
*** 134,139 
--- 152,172 
  
  
  /*
+  * Report amount of shared memory needed for semaphores
+  */
+ Size
+ PGSemaphoreShmemSize(int maxSemas)
+ {
+ #ifdef USE_NAMED_POSIX_SEMAPHORES
+ 	/* No shared memory needed in 

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-10 Thread Jim Nasby

On 12/10/16 1:02 PM, Christophe Pettus wrote:



On Dec 9, 2016, at 22:52, Keith Fiske  wrote:
On Fri, Dec 9, 2016 at 10:01 PM, Robert Haas  wrote:

One thing that's tricky/annoying about this is that if you have a
DEFAULT partition and then add a partition, you have to scan the
DEFAULT partition for data that should be moved to the new partition.
That makes what would otherwise be a quick operation slow.  Still, I'm
sure there's a market for that feature.


I would find that perfectly acceptable as long as a caveat about the 
performance impact was included in the documentation.


+1.  I don't think it's conceptually different from adding a column with a 
default, in that regard; you just have to know the impact.


FWIW, I can think of another option: always check the default partition 
for data, even if the data should only exist in a specific partition. If 
that proved to be too expensive in the normal case it could be optional.


Is it possible to manually (and incrementally) move data from the 
default partition to a table that will become the partition for that 
data and then do a fast cut-over once that's done? That would be akin to 
adding a field without a DEFAULT, adding the DEFAULT after that, and 
then slowly updating all the existing rows...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2016-12-10 Thread Petr Jelinek
On 10/12/16 23:10, Petr Jelinek wrote:
> 
> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
> behavior so that we don't make snapshots for transactions that we seen
> wholly and know that they didn't make catalog changes even if we didn't
> reach consistency yet.
>

Eh, attached wrong patch. This one is correct.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 2add068ed38c2887e6652396c280695a7e384fe7 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 10 Dec 2016 22:22:13 +0100
Subject: [PATCH 2/2] Skip unnecessary snapshot builds

---
 src/backend/replication/logical/snapbuild.c | 82 +++--
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5836d52..ea3f40f 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -955,6 +955,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -976,10 +977,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -992,21 +1002,10 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 
@@ -1018,6 +1017,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+xmax = subxid;
+		}
 	}
 
 	/*
@@ -1026,14 +1035,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
 	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
@@ -1046,10 +1049,18 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (forced_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
 	{
+		bool build_snapshot;
+
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
 		 * catalog modifying, transactions, everything else isn't interesting
@@ -1070,14 +1081,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			return;
 
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Build snapshot if needed. We need to build it if there isn't one
+		 * 

[HACKERS] snapbuild woes

2016-12-10 Thread Petr Jelinek
Hi,

I recently found couple of issues with the way initial logical decoding
snapshot is made.

First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged. This in
turn means that the snapbuild will never make snapshot if the situation
occurs. The reason why this didn't bite us much yet is that the
snapbuild will consider all transactions finished if the empty
xl_running_xacts comes which usually happens after a while (but might
not on a consistently busy server).

The fix I came up with this is to extend the mechanism to also consider
all transactions finished when xl_running_xacts which has xmin bigger
then that previous xmax comes. This seems to work pretty well on busy
servers. This fix is attached as
0001-Mark-snapshot-consistent-when-all-running-txes-have.patch. I
believe it should be backpatched all the way to 9.4.


The other issue is performance problem again on busy servers with
initial snapshot. We track transactions for catalog modifications so
that we can do proper catalog time travel for decoding of changes. But
for transactions that were running while we started trying to get
initial consistent snapshot, there is no good way to tell if they did
catalog changes or not so we consider them all as catalog changing. We
make separate historical snapshot for every such transaction. This by
itself is fine, the problem is that current implementation also
considers all the transactions that started after we started watching
for changes but before we reached consistent state to also do catalog
changes even though there we actually do know if they did any catalog
change or not. In practice it means we make snapshots that are not
really necessary and if there was long running transaction for which the
snapshot builder has to wait for then we can create thousands of unused
snapshots which affects performance in bad ways (I've seen the initial
snapshot taking hour because of this).

The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
behavior so that we don't make snapshots for transactions that we seen
wholly and know that they didn't make catalog changes even if we didn't
reach consistency yet.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 67e902eef33da9e241c079eaed9ab12a88616296 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 10 Dec 2016 21:56:44 +0100
Subject: [PATCH 1/2] Mark snapshot consistent when all running txes have
 finished.

---
 src/backend/replication/logical/snapbuild.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 8b59fc5..5836d52 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1213,9 +1213,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * Build catalog decoding snapshot incrementally using information about
 	 * the currently running transactions. There are several ways to do that:
 	 *
-	 * a) There were no running transactions when the xl_running_xacts record
-	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 * a) Either there were no running transactions when the xl_running_xacts
+	 *record was inserted (we might find this while waiting for b) or c))
+	 *or the running transactions we've been tracking have all finished
+	 *by the time the xl_running_xacts was inserted. We can jump to
+	 *CONSISTENT immediately.
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1251,13 +1253,18 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	}
 
 	/*
-	 * a) No transaction were running, we can jump to consistent.
+	 * a) Either no transaction were running or we've been tracking running
+	 *transactions and the new snapshot has them all finished, we can
+	 *jump to consistent.
 	 *
-	 * NB: We might have already started to incrementally assemble a snapshot,
-	 * so we need to be careful to deal with that.
+	 * NB: Since we might have already started to incrementally assemble a
+	 * snapshot, we need to be careful to deal with that.
 	 */
-	if (running->xcnt == 0)
+	if (running->xcnt == 0 ||
+		(builder->running.xcnt > 0 &&
+		 running->oldestRunningXid > 

Re: [HACKERS] [sqlsmith] Crash in tsquery_rewrite/QTNBinary

2016-12-10 Thread Artur Zakirov
Hi,

2016-12-07 9:06 GMT+03:00 Andreas Seltenreich :
> Hi,
>
> the following query crashes master as of 4212cb7.
>
> select ts_rewrite(
>   tsquery_phrase(
>  tsquery $$'sanct' & 'peter'$$,
>  tsquery $$'5' <-> '6'$$,
>  42),
>   tsquery $$'5' <-> '6'$$,
>   plainto_tsquery('I') );
>

This happens also for queries:

select ts_rewrite(
  to_tsquery('5 & (6 | 5)'),
  to_tsquery('5'),
  to_tsquery('I'));

select ts_rewrite(
  to_tsquery('5 & 6 <-> 5'),
  to_tsquery('5'),
  to_tsquery('I'));

It happens because 'I' is stop word and substitute query becomes
empty. And for queries above we need recursive dropvoidsubtree()
function. Without this patch this function cleans only first level of
tree. And query above becomes: '6 | void'.

Firstly I made recursive dropvoidsubtree(). But attached patch cleans
query tree in dofindsubquery() to avoid extra tree scan.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/utils/adt/tsquery_rewrite.c 
b/src/backend/utils/adt/tsquery_rewrite.c
index e5bed6e..00174bd 100644
--- a/src/backend/utils/adt/tsquery_rewrite.c
+++ b/src/backend/utils/adt/tsquery_rewrite.c
@@ -186,6 +186,15 @@ findeq(QTNode *node, QTNode *ex, QTNode *subs, bool 
*isfind)
  * Recursive guts of findsubquery(): attempt to replace "ex" with "subs"
  * at the root node, and if we failed to do so, recursively match against
  * child nodes.
+ *
+ * Delete any void subtrees. In the following example '5' is replaced by empty
+ * operand:
+ *
+ *AND   ->6
+ *   /   \
+ *  5PHRASE
+ *   / \
+ *  6  5
  */
 static QTNode *
 dofindsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool *isfind)
@@ -200,39 +209,20 @@ dofindsubquery(QTNode *root, QTNode *ex, QTNode *subs, 
bool *isfind)
 
if (root && (root->flags & QTN_NOCHANGE) == 0 && root->valnode->type == 
QI_OPR)
{
-   int i;
-
-   for (i = 0; i < root->nchild; i++)
-   root->child[i] = dofindsubquery(root->child[i], ex, 
subs, isfind);
-   }
-
-   return root;
-}
-
-/*
- * Delete any void subtrees that may have been inserted when the replacement
- * subtree is void.
- */
-static QTNode *
-dropvoidsubtree(QTNode *root)
-{
-   if (!root)
-   return NULL;
-
-   if (root->valnode->type == QI_OPR)
-   {
int i,
j = 0;
 
for (i = 0; i < root->nchild; i++)
{
-   if (root->child[i])
-   {
-   root->child[j] = root->child[i];
+   root->child[j] = dofindsubquery(root->child[i], ex, 
subs, isfind);
+   if (root->child[j])
j++;
-   }
}
 
+   /*
+* Delete any void subtrees that may have been inserted when 
the replacement
+* subtree is void.
+*/
root->nchild = j;
 
if (root->nchild == 0)
@@ -267,9 +257,6 @@ findsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool 
*isfind)
 
root = dofindsubquery(root, ex, subs, );
 
-   if (!subs && DidFind)
-   root = dropvoidsubtree(root);
-
if (isfind)
*isfind = DidFind;
 
diff --git a/src/test/regress/expected/tsearch.out 
b/src/test/regress/expected/tsearch.out
index 55d6a85..9c20aaf 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1251,6 +1251,14 @@ SELECT ts_rewrite('5 <-> (6 | 8)', 'SELECT keyword, 
sample FROM test_tsquery'::t
  '5' <-> '7' | '5' <-> '8'
 (1 row)
 
+-- Check empty substitute
+SELECT ts_rewrite(to_tsquery('5 & 6 <-> 5'), to_tsquery('5'), to_tsquery('I'));
+NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored
+ ts_rewrite 
+
+ '6'
+(1 row)
+
 SELECT keyword FROM test_tsquery WHERE keyword @> 'new';
 keyword 
 
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index afd990e..8752344 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -418,6 +418,8 @@ SELECT ts_rewrite('1 & (2 <2> 3)', 'SELECT keyword, sample 
FROM test_tsquery'::t
 SELECT ts_rewrite('5 <-> (1 & (2 <-> 3))', 'SELECT keyword, sample FROM 
test_tsquery'::text );
 SELECT ts_rewrite('5 <-> (6 | 8)', 'SELECT keyword, sample FROM 
test_tsquery'::text );
 
+-- Check empty substitute
+SELECT ts_rewrite(to_tsquery('5 & 6 <-> 5'), to_tsquery('5'), to_tsquery('I'));
 
 SELECT keyword FROM test_tsquery WHERE keyword @> 'new';
 SELECT keyword FROM test_tsquery WHERE keyword @> 'moscow';

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-10 Thread Christophe Pettus

> On Dec 9, 2016, at 22:52, Keith Fiske  wrote:
> On Fri, Dec 9, 2016 at 10:01 PM, Robert Haas  wrote:
>> One thing that's tricky/annoying about this is that if you have a
>> DEFAULT partition and then add a partition, you have to scan the
>> DEFAULT partition for data that should be moved to the new partition.
>> That makes what would otherwise be a quick operation slow.  Still, I'm
>> sure there's a market for that feature.
> 
> I would find that perfectly acceptable as long as a caveat about the 
> performance impact was included in the documentation.

+1.  I don't think it's conceptually different from adding a column with a 
default, in that regard; you just have to know the impact.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Effect of caching hash bucket size while costing

2016-12-10 Thread Robert Haas
On Thu, Dec 8, 2016 at 3:53 AM, Srinivas Karthik V
 wrote:
> Dear PostgreSQL Hackers,
>
> I am working in PostgreSQL 9.4.* optimizer module. In costsize.c file and
> final_cost_hashjoin() function, the innerbucketsize is either:
>
> a) calculated using a cached copy
>   OR
> b)  calculated afresh using statistics captured by the following code
> snippet:
> thisbucketsize =   estimate_hash_bucketsize(root,
> get_leftop(restrictinfo->clause),virtualbuckets);
>
> For the query I used, if I disable the caching for calculating the
> innerbucketsize, I get a different plan with cost change of around 1000
> units.
>
> 1) Can you please let me know if innerbucketsize*innerpathrows captures the
> maximum bucket size?
> 2) why is it not calculated afresh all the time?

Well, #2 is answered there right in the comments:

 * Since we tend to visit the same clauses
over and over when
 * planning a large query, we cache the
bucketsize estimate in the
 * RestrictInfo node to avoid repeated lookups
of statistics.

I assume the person who wrote the comment thought that the answer
wouldn't change from one call to the next, and therefore it was safe
to cache.  I don't know why that isn't the case for you.

As to question #1, there's a comment for that, too, a little further down:

 * The number of tuple comparisons needed is the number of outer
 * tuples times the typical number of tuples in a hash
bucket, which
 * is the inner relation size times its bucketsize
fraction.  At each
 * one, we need to evaluate the hashjoin quals.  But actually,

So innerbucketsize*innerpathrows represents the expected number of
comparisons that we expect to need to perform per hash probe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2016-12-10 Thread Jeff Janes
On Thu, Dec 8, 2016 at 10:28 AM, Tom Lane  wrote:

> Robert Haas  writes:
> > Maybe it would help for Jeff to use elog_node_display() to the nodes
> > that are causing the problem - e.g. outerpathkeys and innerpathkeys
> > and best_path->path_mergeclauses, or just best_path - at the point
> > where the error is thrown. That might give us enough information to
> > see what's broken.
>
> I'll be astonished if that's sufficient evidence.  We already know that
> the problem is that the input path doesn't claim to be sorted in a way
> that would match the merge clauses, but that doesn't tell us how such
> a path came to be generated (or, if it wasn't intentionally done, where
> the data structure got clobbered later).
>
> It's possible that setting a breakpoint at create_mergejoin_path and
> capturing stack traces for all calls would yield usable insight.  But
> there are likely to be lots of calls if this is an 8-way join query,
> and probably only a few are wrong.
>
> I'd much rather have a test case than try to debug this remotely.
> Bandwidth too low.
>

I didn't get an asserts on the assert-enabled build.

I have a test case where I made the fdw connect back to itself, and
stripped out all the objects that I could and still reproduce the case.  It
is large, 21MB compressed, 163MB uncompressed, so I am linking it here:

https://drive.google.com/open?id=0Bzqrh1SO9FcEZkpPM0JwUEMzcmc

When running this against and default configured 9.6.1 or 10devel-a73491e,
I get the error.

psql -U postgres -f <(xzcat combined_anon.sql.xz)

...

ERROR:  outer pathkeys do not match mergeclauses
STATEMENT:  explain update local1 set local19 = jj_join.local19 from
jj_join where local1.local12=jj_join.local12  and local1.local12 in
('ddd','bbb','aaa','abc');

Cheers,

Jeff


[HACKERS] Proposal : Parallel Merge Join

2016-12-10 Thread Dilip Kumar
I would like to propose a patch for parallelizing merge join path.
This idea is derived by analyzing TPCH results.

I have done this analysis along with my colleagues Rafia sabih and Amit kaplia.

Currently we already have infrastructure for executing parallel join,
so we don't need any change at executor level. Only in path
generation, we need to add partial paths for merge join, like we do
for nest loop and hash join.

After this patch, we will get some new type of paths as shown below.

-> Gather
-> MergeJoin
 -> Sort
  -> Partial Seq Scan
   -> Any parallel safe inner path
or
-> Gather
   -> MergeJoin
  -> Partial Index Scan path (or any sorted partial path)
  -> Any parallel safe inner path


Major benefit of this patch is, it can be helpful in converting some
paths to parallel paths where they were not using parallelism at all,
may be because we need merge join at intermediate level.


Performance:
--
- I applied this patch on head and tested TPCH (as of now only with
scale factor 10). And, observed that Q20 is giving 2x gain. On head,
this query was not using parallelism at all, but after parallel merge,
it's able to use parallelism all the way up to top level join (explain
analyze result is attached).

- I need to test this with different scale factor and also with other
patches like parallel index scan, gather merge and I hope to see more
paths getting benefited.

* 'gather merge' will be useful where we expect sorted path from merge
join, because parallel merge join will not give sorted result.

Test configuration and machine detail:
--
TPCH: scale factor : 10
work_mem  : 100MB
shared buffer   : 1GB
max_parallel_workers_per_gather : 3
Machine   : POWER 4 socket machine.


Attached file:
--
1. parallel_mergejooin_v1.patch : parallel merge join patch
2. 20_head.out   : Explain analyze output on head (median of 3 runs)
3. 20_patch.out  : Explain analyze output with patch (median of 3 runs)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


20_head.out
Description: Binary data


20_patch.out
Description: Binary data


parallel_mergejoin_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2016-12-10 Thread Amit Kapila
On Wed, Nov 30, 2016 at 11:08 AM, Dilip Kumar  wrote:
>>
>> patch details
>> 1. hash_support_alloc_free_v1.patch [1].
>> 2. parallel-bitmap-heap-scan-v3.patch

Few assorted comments:

1.
+ else if (needWait)
+ {
+ /* Add ourself to wait queue */
+ ConditionVariablePrepareToSleep(>cv);
+ queuedSelf = true;
+ }

With the committed version of condition variables, you can avoid
calling ConditionVariablePrepareToSleep().  Refer latest parallel
index scan patch [1].

2.
+ ParallelBitmapScan
+ Waiting for leader to complete the TidBitmap.
+
+

Isn't it better to write it as below:

Waiting for the leader backend to form the TidBitmap.

3.
+ * Update snpashot info in heap scan descriptor.

/snpashot/snapshot

4.
 #include "utils/tqual.h"
-
+#include "miscadmin.h"

 static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
-
+static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);

  TBMIterateResult *tbmres;
-
+ ParallelBitmapInfo *pbminfo = node->parallel_bitmap;

 static int tbm_comparator(const void *left, const void *right);
-
+void * tbm_alloc_shared(Size size, void *arg);

It seems line deletes at above places are spurious.  Please check for
similar occurrences at other places in patch.

5.
+ bool is_shared; /* need to build shared tbm if set*/

space is required towards the end of the comment (set */).

6.
+ /*
+ * If we have shared TBM means we are running in parallel mode.
+ * So directly convert dsa array to page and chunk array.
+ */

I think the above comment can be simplified as "For shared TBM,
directly convert dsa array to page and chunk array"

7.
+ dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));

extra space is missing at multiple places in above line. It should be
written as below:

dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));

Similar stuff needs to be taken care at other places in the patch as
well.  I think it will be better if you run pgindent on your patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-10 Thread Amit Kapila
On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada  wrote:
> On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
>  wrote:
>> On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas  wrote:
>>> You could do that, but first I would code up the simplest, cleanest
>>> algorithm you can think of and see if it even shows up in a 'perf'
>>> profile.  Microbenchmarking is probably overkill here unless a problem
>>> is visible on macrobenchmarks.
>>
>> This is what I would go for! The current code is doing a simple thing:
>> select the Nth element using qsort() after scanning each WAL sender's
>> values. And I think that Sawada-san got it right. Even running on my
>> laptop a pgbench run with 10 sync standbys using a data set that fits
>> into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
>> using perf top on a non-assert, non-debug build. Hash tables and
>> allocations get a far larger share. Using the patch,
>> SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
>> nodes. Let's kick the ball for now. An extra patch could make things
>> better later on if that's worth it.
>
> Yeah, since the both K and N could be not large these algorithm takes
> almost the same time. And current patch does simple thing. When we
> need over 100 or 1000 replication node the optimization could be
> required.
> Attached latest v9 patch.
>

Few comments:

+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.

In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax?  If so,
it might be better to write about it explicitly.


2.
 --- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
  }

  /*
- * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
- * from each input tape.
- */
- state->memtupsize = numInputTapes;
- state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
- USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
- /*
- * Use all the remaining memory we have available for read buffers among
- * the input tapes.
+ * Use all the spare memory we have available for read buffers among the
+ * input tapes.

This doesn't belong to this patch.

3.
+ * Return the list of sync standbys using according to synchronous method,

In above sentence, "using according" seems to either incomplete or wrong usage.

4.
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ "invalid synchronization method is specified \"%d\"",
+ SyncRepConfig->sync_method));

Here, the error message doesn't seem to aligned and you might want to
use errmsg for the same.

5.
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.

/oldest these/oldest of these
/newest these positions specified/newest of these positions as specified

Instead of newest, can we consider to use latest?

6.
+ int sync_method; /* synchronization method */
  /* member_names contains nmembers consecutive nul-terminated C strings */
  char member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers