Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2014-07-19 Thread Amit Kapila
On Fri, Jul 18, 2014 at 7:08 PM, MauMau  wrote:
>
> From: "Magnus Hagander" 
>
>> On Fri, Jul 18, 2014 at 5:33 AM, Amit Kapila 
wrote:
>>>
>>> On Thu, Jul 17, 2014 at 4:51 PM, Magnus Hagander 
>>> wrote:


 Did anyone actually test this patch? :)

 I admit I did not build it on Windows specifically because I assumed
 that was done as part of the development and review. And the changes
 to pg_event.c can never have built, since the file does not include
 the required header.
>>>
>>>
>>> I have tested it on Windows and infact on Linux as well to see if there
>>> is any side impact before marking it as Ready For Committer.
>>>
>>> It seems to me that the required header is removed in last version
>>> (pg_ctl_eventsrc_v11) where MessageBox() related changes have been
>>> removed from patch as per recent discussion.  Sorry for not being able
>>> to check last version posted.
>>
>>
>> Gotcha. Thanks for clarifying, and I apologize if I came across a bit
>> harsh even with the smiley.

The statement was not at all harsh.  I think you are right in asking that
question and I believe that is the minimum expectation once the patch
reaches Ready To Committer stage.

>
> I'm sorry to have caused both of you trouble.  I have to admit that I
didn't compile the source when I removed the MessageBox()-related changes.
 The attached patch fixes that.  I confirmed successful build this time.

I have tested the attached patch on windows, it works fine both for
default and non-default cases.  It passes other regression tests as well
and build went fine on Linux.

One more thing about inclusion of new header file in pgevent.c,  I think
that is okay because we include it in other modules (client side) present
in bin directory like reindex.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] SQL MERGE is quite distinct from UPSERT

2014-07-19 Thread Peter Geoghegan
Someone asked me privately why we weren't immediately pursuing SQL
MERGE, as opposed to yet another non-standard variant of UPSERT
(UPSERT is loosely defined here as an insert-or-update DML command
that goes to update based on would-be unique constraint violations,
and does one or the other *atomically*). I believe at least one other
system (Teradata) that now has both formerly only had the latter. I
think it makes sense to have both. While I believe that it's already
generally accepted that SQL MERGE is distinct from UPSERT in fairly
fundamental ways, in that it solves a somewhat distinct set of
problems, I would like to put my thoughts on the matter on record.
This is mostly for the sake of the archives.

The first reason why the two are distinct is that MERGE, as
implemented within other major systems and as described by the SQL
standard does not require an atomic insert-or-update for even the
simplest possible case of OLTP style insert-or-update. As I went into
during my talk at pgCon, even these simple cases may throw errors on
other systems' MERGE implementations. This leeway to fail may be
valuable for some use cases though, like the bulk loading use case.
Simon Riggs has characterized MERGE as something that exists for the
benefit of those two fairly distinct use cases [1], and I agree with
him. I suspect that the predictability of MERGE's behavior suffers due
to trying to support both at once, and I think that MERGE should be
specifically promoted as a bulk loading command if we ever get around
to adding it to PostgreSQL.

The second reason they differ is that SQL MERGE as implemented within
other systems doesn't require that a unique index be defined on the
column or columns of the MERGE statement's outer join. I am not
enthusiastic about the idea of falling back to table-level locking
when an appropriate index isn't available. That could be a foot-gun
for the large majority of users that just want to solve their basic
insert-or-update problem. I did think to mention this when asked about
my work on UPSERT, but the next question was: "So why not just throw
an error when it's not possible to use an available unique index"?
This is a fair point, but doesn't address all the problems with trying
to make MERGE perform upserts that meet my requirements for UPSERT
[2](principally the requirement of atomicity in the sense of always
getting an insert or update).

These two points, particularly the first are likely to weigh upon the
implementation of both, which is why we cannot very well just insist
upon having appropriate unique indexes defined. It buys my UPSERT
implementation plenty to have the upsert driven by an insert into a
unique index. With UPSERT, having an insert occur is always an outcome
that the DML statement's author will consider acceptable. SQL MERGE
statements can be written without any WHEN NOT MATCHED THEN INSERT;
SQL MERGE statement authors are not necessarily going to expect an
insert at all.

As I outlined in my pgCon talk, there appears to be a fundamental
trade-off involved when implementing UPSERT. My implementation is
prepared for the possibility that upon finding a duplicate, and upon
subsequently attempting to lock a row, the row may be found to be
deleted. When that happens, we try again without having made any
useful progress in the first iteration. The ideal of a strict queue of
upserters, with totally fair row locking style lock arbitration
appears to be basically impossible, at least provided you're unwilling
to give up insert-or-update atomicity (that is, to get this you have
to be willing to give up with an error even at READ COMMITTED, which
is a cop out, or buy into using something like table level locks,
which is also a cop out). Theoretical lock starvation hazards exist in
my implementation because I don't do this, but all indications are
that this is fine in the real world. In this regard my implementation
is similar to the current looping upsert subtransaction pattern we've
been recommending for years [3]. We don't hold on to "value locks" on
values within indexes (except to finish staggered index-locks-first
inserting - but not to update/row lock).

This is all fine in the real world because *somebody* has to make
progress when there is highly contended upserting. Some session has to
insert-or-update successfully - otherwise you cannot have a conflict
to begin with. Livelocking is therefore not possible, and you need a
*sustained* perfect storm of conflicts starving out one session in
order to see lock starvation at all. SQL MERGE has an outer join on a
table of proposed values (as opposed to an inner join against already
rejected values). MERGE allows you to do something other than insert
WHEN NOT MATCHED (i.e. to do nothing), and I think that being able to
always insert at that juncture gives us a way to usefully terminate
the loop where there might not be another way.

Suppose you had a similar "keep retrying" implementation (similar to
what I've proposed for 

Re: [HACKERS] Making joins involving ctid work for the benefit of UPSERT

2014-07-19 Thread Peter Geoghegan
On Fri, Jul 18, 2014 at 11:23 AM, Andres Freund  wrote:
> Meh. A understandable syntax wouldn't require the pullups with a special
> scan node and such.

Well, in general ExecModifyTable()/ExecUpdate() trusts the tid passed
to match the qual in the query. Unless you're willing to give up on
the idea of having a qual on the update (other than something like an
ON CONFLICTS qual, obviously) I think you'll probably end up with some
kind of pseudo-scan node anyway, if only for consistency with how
things already work, to give you somewhere to show the Filter in
explain output and so on. ExecScan() probably needs to ExecQual().
Inserts don't have scan nodes, but updates do, and so it seems pretty
odd to make the "Insert" node (a ModifyTable node) pullup from a
result node (as always), and the "Update" node (also a ModifyTable
node) treat the first ModifyTable node as its own pseudo-scan node,
when in fact we are scanning the heap in a manner not unlike a
conventional update. Or do you envision a second result node where an
update qual might be evaluated? I am not enthused about either
possibility.

The idea of not having the ability to put a predicate on the update at
all, much like the MySQL thing isn't completely outrageous, but it
certainly isn't going to go down well those that have already
expressed concerns about how much of a foot gun this could be.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for updating src/timezone

2014-07-19 Thread Gavin Flower

On 20/07/14 06:30, John Cochran wrote:




On Sat, Jul 19, 2014 at 11:58 AM, Michael Banck > wrote:

SNIP

Maybe if you pgindent the IANA code as well, you can more easily diff
the actual changes between the two, did you try that?


Michael


Unfortunately, pgindent doesn't work well with the IANA code as 
evident by some previous checkins.


To quote a checkin done 2004-05-21 16:59:10 by Tom Lane
pgindent did a pretty awful job on the timezone code, particularly 
with respect to doubly-starred comment blocks.  Do some manual cleanup.



[...]

For simply noting functional differences: could not make copies of the 
sources files for both the pg & IANA code, and remove ALL indentation 
prior to to a a diff?



Cheers,
Gavin


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 19.7.2014 23:07, Tomas Vondra wrote:
> On 19.7.2014 20:28, Tomas Vondra wrote:
> For the first case, a WARNING at the end of estimate_hash_bucketsize
> says this:
> 
> WARNING:  nbuckets=8388608.00 estfract=0.01
> WARNING:  nbuckets=65536.00 estfract=0.000267
> 
> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
> 
> For the second case, I get this:
> 
> WARNING:  nbuckets=16777216.00 estfract=0.01
> WARNING:  nbuckets=262144.00 estfract=0.000100
> 
> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
> 
> That sounds really strange, because small_table in the second test case
> is almost perfectly unique. And in the first test case, there are only
> very few keys with >2 occurences.

FWIW, the "problem" seems to be this:

  /*
   * Adjust estimated bucketsize upward to account for skewed
   * distribution. */

  if (avgfreq > 0.0 && mcvfreq > avgfreq)
estfract *= mcvfreq / avgfreq;

Which is explained like in the estimate_hash_bucketsize comment like this:

* be nonuniformly occupied. If the other relation in the join has a key
* distribution similar to this one's, then the most-loaded buckets are
* exactly those that will be probed most often. Therefore, the "average"
* bucket size for costing purposes should really be taken as something *
close to the "worst case" bucket size. We try to estimate this by
* adjusting the fraction if there are too few distinct data values, and
* then scaling up by the ratio of the most common value's frequency to
* the average frequency.

The problem is that the two testcases don't behave like this, i.e. the
other relation does not have similar distribution (because there the
join key is unique). Actually, I wonder how frequently that happens and
I wouldn't be surprised if it was quite rare.

So maybe we shouldn't scale it the way we scale it now. For example we
could do this:

  if (avgfreq > 0.0 && mcvfreq > avgfreq)
estfract *= sqrt(mcvfreq / avgfreq);

Or instead of using the first MCV frequency (i.e. the frequency of the
most common value, which causes the unexpectedly hight tuple/bucket
values), use an average or median of the MCV frequenfies.

Either of these three changes "fixed" the first test case, some fixed
the second one. However it's pure guesswork, and I have how many plans
will be hit negatively by these changes.

What I think might work better is lowering the cost of searching small
hash tables, when the hash table fits into L1/L2... caches. The table I
posted in the first message in this thread illustrates that:

  kB  NTUP=1  NTUP=2  NTUP=4  NTUP=5  NTUP=9  NTUP=10
  14076753721878909324948812574
  7032941711586   17417   15820   25732   25402
  35157   13179   17101   27225   24763   43323   43175
  70313   14203   18475   29354   26690   46840   46676
  175782  14319   17458   25325   34542   37342   60949
  351563  16297   19928   29124   43364   43232   70014
  703125  19027   23125   33610   50283   50165   81398

So a small hash table (~1.4MB), fitting into L2 (measured on CPU with
~4MB cache) is influenced much less than large tables. I don't know
whether we should detect the CPU cache size somehow, or whether we
should assume some conservative size (say, 4MB sounds OK), or what. But
I think something like this might work

  if (hash_size < 4MB) {
 /* take in only 10% of the difference */
 estfract += (mcvfreq / avgfreq) * estfract * 0.1;
  } else if (hash_size > 16MB) {
 estfract *= (mcvfreq / avgfreq);
  } else {
 /* linear approximation between 10 and 100% */
 estfract += (mcvfreq / avgfreq) * estfract
 * (0.1 + 0.9 * (hash_size - 4MB) / 12MB);
  }

Or maybe not, I'm not really sure. The problem is that the two test
cases really don't match the assumption that the outer relation has the
same distribution.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 19.7.2014 20:28, Tomas Vondra wrote:
> On 19.7.2014 20:24, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> I've reviewed the two test cases mentioned here, and sadly there's
>>> nothing that can be 'fixed' by this patch. The problem here lies in the
>>> planning stage, which decides to hash the large table - we can't fix
>>> that in the executor.
>>
>> We've heard a couple reports before of the planner deciding to hash a
>> larger table rather than a smaller one.  The only reason I can think of
>> for that is if the smaller table has many more duplicates, so that the
>> planner thinks the executor might end up traversing long hash chains.
>> The planner's estimates could easily be off in this area, of course.
>> estimate_hash_bucketsize() is the likely culprit if it's wrong.
>>
>> Which test case are you seeing this in, exactly?
> 
> The two reported by Stephen here:
> 
> 
> http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net
> 
> Just download this (I had to gunzip it):
> 
>   http://snowman.net/~sfrost/test_case.sql
>   http://snowman.net/~sfrost/test_case2.sql

Anyway, looking a bit at the distributions, I don't think the
distributions are significantly skewed. In the first testscase, I get this

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;
 cnt | count
-+---
   1 |23
   2 | 18795
   3 |13
   4 |   726
   5 | 4
   6 |93
   8 | 4
  10 | 1
(8 rows)

and in the second one I get this:

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;
 cnt | count
-+
   1 | 182161
   2 |   5582
   3 |371
   4 | 28
   5 |  2
   6 |  3
   9 |  1
(7 rows)

So while #1 contains most values twice, #2 is almost perfectly distinct.
In the big_table, the values are unique.

For the first case, a WARNING at the end of estimate_hash_bucketsize
says this:

WARNING:  nbuckets=8388608.00 estfract=0.01
WARNING:  nbuckets=65536.00 estfract=0.000267

There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>= 1e-6), and ~10 tuples/bucket for the small_table (42k rows).

For the second case, I get this:

WARNING:  nbuckets=16777216.00 estfract=0.01
WARNING:  nbuckets=262144.00 estfract=0.000100

i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.

That sounds really strange, because small_table in the second test case
is almost perfectly unique. And in the first test case, there are only
very few keys with >2 occurences.


By explicitly setting the selectivity in estimate_hash_bucketsize to
1e-6, I got these plans:

Test case #1

  QUERY PLAN
-
 Hash Join  (cost=1229.46..79288.95 rows=41176 width=24) (actual
time=50.397..634.986 rows=13731 loops=1)
   Hash Cond: (big_table.id_short = small_table.id_short)
   ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4)
(actual time=0.018..229.597 rows=4272271 loops=1)
   ->  Hash  (cost=714.76..714.76 rows=41176 width=24) (actual
time=31.653..31.653 rows=41176 loops=1)
 Buckets: 65536  Batches: 1  Memory Usage: 3086kB
 ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176
width=24) (actual time=0.012..13.341 rows=41176 loops=1)
 Planning time: 0.597 ms
 Execution time: 635.498 ms
(8 rows)

Without explain analyze: 486 ms (original plan: >850ms).

Test case #2

  QUERY PLAN
-
 Hash Join  (cost=5240.21..220838.44 rows=194587 width=4) (actual
time=88.252..2143.457 rows=149540 loops=1)
   Hash Cond: (big_table.id_short = small_table.id_short)
   ->  Seq Scan on big_table  (cost=0.00..169569.72 rows=11755372
width=4) (actual time=0.018..533.955 rows=11755475 loops=1)
   ->  Hash  (cost=2807.87..2807.87 rows=194587 width=4) (actual
time=87.548..87.548 rows=194587 loops=1)
 Buckets: 262144  Batches: 1  Memory Usage: 8889kB
 ->  Seq Scan on small_table  (cost=0.00..2807.87 rows=194587
width=4) (actual time=0.012..29.929 rows=194587 loops=1)
 Planning time: 0.438 ms
 Execution time: 2146.818 ms
(8 rows)

Without explain analyze: 1600 ms (original plan: >2300ms)

The differences are fairly small - well, it's 30-40% faster, but it's
less than 1s in both cases. I'm wondering whether we could get into
similar problems with much longer queries and still get 30-40% improvement.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for updating src/timezone

2014-07-19 Thread John Cochran
On Sat, Jul 19, 2014 at 11:58 AM, Michael Banck  wrote:
SNIP
>
> Maybe if you pgindent the IANA code as well, you can more easily diff
> the actual changes between the two, did you try that?
>
>
> Michael
>

Unfortunately, pgindent doesn't work well with the IANA code as evident by
some previous checkins.

To quote a checkin done 2004-05-21 16:59:10 by Tom Lane
pgindent did a pretty awful job on the timezone code, particularly with
respect to doubly-starred comment blocks.  Do some manual cleanup.

As it is, I've finished checking the differences between the postgres and
IANA code for zic.c after editing both to eliminate non-functional style
differences such as indentation, function prototypes, comparing strchr
results against NULL or 0, etc. It looks like the only differences from the
stock code is support for the -P option implemented by Tom Lane and the
substitution of the postgres code for getopt instead of the unix default as
well as a few minor changes in some defaults and minor modifications to
deal with Windows. Overall, rather small and trivial.

Additionally, I discovered a little piece of low hanging fruit. The
Makefile for the timezone code links together zic.o ialloc.o scheck.o and
localtime.o in order to make zic.  Additionally, zic.c has a stub
implementation of pg_open_tzfile() in order to resolve a linkage issue with
the localtime.o module for zic.
Well, as it turns out, localtime.o doesn't supply ANY symbols that zic
needs and therefore can be omitted entirely from the list of object files
comprising zic. Which in turn means that the stub implementation of
pg_open_tzfile can be removed from the postgres version of zic.c.  I'm not
bothering to submit a patch involving this since that patch will be quite
short lived given my objective to bring the code up to date with the 2014e
version of the IANA code. But I am submitting a bug report to IANA on the
Makefile since the unneeded linkage with localtime.o is still in the 2014e
code on their site.

-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 19.7.2014 20:24, Tom Lane wrote:
> Tomas Vondra  writes:
>> I've reviewed the two test cases mentioned here, and sadly there's
>> nothing that can be 'fixed' by this patch. The problem here lies in the
>> planning stage, which decides to hash the large table - we can't fix
>> that in the executor.
> 
> We've heard a couple reports before of the planner deciding to hash a
> larger table rather than a smaller one.  The only reason I can think of
> for that is if the smaller table has many more duplicates, so that the
> planner thinks the executor might end up traversing long hash chains.
> The planner's estimates could easily be off in this area, of course.
> estimate_hash_bucketsize() is the likely culprit if it's wrong.
> 
> Which test case are you seeing this in, exactly?

The two reported by Stephen here:


http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net

Just download this (I had to gunzip it):

  http://snowman.net/~sfrost/test_case.sql
  http://snowman.net/~sfrost/test_case2.sql

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 13.7.2014 21:32, Tomas Vondra wrote:
> The current patch only implemnents this for tuples in the main hash 
> table, not for skew buckets. I plan to do that, but it will require 
> separate chunks for each skew bucket (so we can remove it without 
> messing with all of them). The chunks for skew buckets should be
> smaller (I'm thinking about ~4kB), because there'll be more of them,
> but OTOH those buckets are for frequent values so the chunks should
> not remain empty.

I've looked into extending the dense allocation to the skew buckets, and
I think we shouldn't do that. I got about 50% of the changes and then
just threw it out because it turned out quite pointless.

The amount of memory for skew buckets is limited to 2% of work mem, so
even with 100% overhead it'll use ~4% of work mem. So there's pretty
much nothing to gain here. So the additional complexity introduced by
the dense allocation seems pretty pointless.

I'm not entirely happy with the code allocating some memory densely and
some using traditional palloc, but it certainly seems cleaner than the
code I had.

So I think the patch is mostly OK as is.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tom Lane
Tomas Vondra  writes:
> I've reviewed the two test cases mentioned here, and sadly there's
> nothing that can be 'fixed' by this patch. The problem here lies in the
> planning stage, which decides to hash the large table - we can't fix
> that in the executor.

We've heard a couple reports before of the planner deciding to hash a
larger table rather than a smaller one.  The only reason I can think of
for that is if the smaller table has many more duplicates, so that the
planner thinks the executor might end up traversing long hash chains.
The planner's estimates could easily be off in this area, of course.
estimate_hash_bucketsize() is the likely culprit if it's wrong.

Which test case are you seeing this in, exactly?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 14.7.2014 06:29, Stephen Frost wrote:
> Tomas,
> 
> * Tomas Vondra (t...@fuzzy.cz) wrote:
>> On 6.7.2014 17:57, Stephen Frost wrote:
>>> * Tomas Vondra (t...@fuzzy.cz) wrote:
 I can't find the thread / test cases in the archives. I've found this
 thread in hackers:

 http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com

 Can you point me to the right one, please?
>>>
>>> This:
>>>
>>> http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net
>>
>> Sadly, the link to the SQL does not work :-(
>>
>>   http://snowman.net/~sfrost/test_case.sql => 404
>>
>>> and I think there may have been other test cases on that thread
>>> somewhere...  I certainly recall having at least a couple of them that I
>>> was playing with.
>>>
>>> I can probably find them around here somewhere too, if necessary.
>>
>> If you could look for them, that'd be great.
> 
> cc'ing me directly helps me notice these...
> 
> I've added back test_case.sql and test_case2.sql.  Please check them out
> and let me know- they're real-world cases we've run into.

Hi Stephen,

I've reviewed the two test cases mentioned here, and sadly there's
nothing that can be 'fixed' by this patch. The problem here lies in the
planning stage, which decides to hash the large table - we can't fix
that in the executor.

I did three runs - with master, with the 'dense allocation' patch
applied, and with 'dense allocation + ntup patch' applied.

As the estimates are pretty exact here, the ntup patch effectively only
sets NTUP_PER_BUCKET=1 (and does not do any dynamic resize or so).

For the first testcase, it behaves like this:

### master (no patches) 

 QUERY PLAN
-
 Hash Join  (cost=115030.10..116671.32 rows=41176 width=24) (actual
time=980.363..1015.623 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24)
(actual time=0.015..2.563 rows=41176 loops=1)
   ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=979.399..979.399 rows=4272271 loops=1)
 Buckets: 524288  Batches: 1  Memory Usage: 150198kB
 ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271
width=4) (actual time=0.026..281.151 rows=4272271 loops=1)
 Planning time: 0.196 ms
 Execution time: 1034.931 ms
(8 rows)

# without explain analyze

Time: 841,198 ms
Time: 825,000 ms

### master + dense allocation 

 QUERY PLAN
-
 Hash Join  (cost=115030.10..116671.32 rows=41176 width=24) (actual
time=778.116..815.254 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24)
(actual time=0.006..2.423 rows=41176 loops=1)
   ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=775.803..775.803 rows=4272271 loops=1)
 Buckets: 524288  Batches: 1  Memory Usage: 150198kB
 ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271
width=4) (actual time=0.005..227.274 rows=4272271 loops=1)
 Planning time: 0.061 ms
 Execution time: 826.346 ms
(8 rows)

# without explain analyze

Time: 758,393 ms
Time: 734,329 ms

### master + dense allocation + ntup=1 patch 

 QUERY PLAN
-
 Hash Join  (cost=115030.10..116465.44 rows=41176 width=24) (actual
time=1020.288..1033.539 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24)
(actual time=0.015..2.214 rows=41176 loops=1)
   ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=953.890..953.890 rows=4272271 loops=1)
 Buckets: 8388608  Batches: 1  Memory Usage: 215734kB
 ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271
width=4) (actual time=0.014..241.223 rows=4272271 loops=1)
 Planning time: 0.193 ms
 Execution time: 1055.351 ms
(8 rows)

# without explain analyze

Time: 836,050 ms
Time: 822,818 ms

#

For the second test case, the times are like this:

# master
Time: 2326,579 ms
Time: 2320,232 ms

# master + dense allocation
Time: 2207,254 ms
Time: 2251,691 ms

# master + dense allocation + ntup=1
Time: 2374,278 ms
Time: 2377,409 ms

So while the dense allocation shaves off ~5%, the time with both patches
is slightly higher (again, ~5% difference). After spending some time by
profiling the code, I believe it's because of worse cache hit ratio when
accessing the buckets. By decreasing NTUP_PER_BUCKET the second test
case now uses ~130MB of buckets, while originally it used only ~16MB).
That means slightly lower

Re: [HACKERS] Proposal for updating src/timezone

2014-07-19 Thread Michael Banck
Hi,

On Sat, Jul 19, 2014 at 09:28:25AM -0400, John Cochran wrote:
> Agreed. Right now, I'm seeing about updating zic.c to match the IANA code
> combined with the modifications that postgres did to it. So far, it doesn't
> look like many functional changes have been done, but due to the use of
> pgindent, there's a LOT of cosmetic changes that add one heck of a lot of
> noise in determining the actual differences. 

Maybe if you pgindent the IANA code as well, you can more easily diff
the actual changes between the two, did you try that?


Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for updating src/timezone

2014-07-19 Thread John Cochran
Agreed. Right now, I'm seeing about updating zic.c to match the IANA code
combined with the modifications that postgres did to it. So far, it doesn't
look like many functional changes have been done, but due to the use of
pgindent, there's a LOT of cosmetic changes that add one heck of a lot of
noise in determining the actual differences. After I get the code closely
matching the IANA code, I'll submit a patch (unfortunately it will pretty
much be an entire replacement of zic.c due to the massive changes IANA did
between 2010c and 2014e). I will request very strongly that pgindent not be
used again on the IANA code so future maintainers can easily perform a diff
between the IANA code and the postgres code to determine the actual
differences. I'll then see about doing the same with the other source files
in timezone.




On Fri, Jul 18, 2014 at 4:27 PM, Tom Lane  wrote:

> John Cochran  writes:
> > Did a diff between the 2010c version of localtime.c and the postgres
> > version and saw a lot more differences than what could have been expected
> > from simple reformatting and adaptation. So I installed gitk and took a
> > look at the change history of localtime.c
>
> > Well, looking at the results, it pretty much put paid to the concept of
> > ever importing the IANA code unaltered into the postgres tree. In fact,
> it
> > looks like the original version of localtime.c was based upon one of the
> > 2004a thru 2004c versions of the IANA code and since then has been
> > independently maintained.
>
> That's certainly true, but I'm not sure that we should give up all that
> easily on getting closer to the IANA code again.  For one thing, some of
> the thrashing had to do with the fact that we went over to 64-bit
> timestamps sooner than the IANA crew did.  I'm assuming that they insist
> on 64-bit arithmetic now; we do too, so for instance some of the typedef
> hacking might no longer be necessary.
>
> AFAICT from a quick scan of the git logs, the main thing we've done to the
> API that's not in IANA is invent pg_next_dst_boundary().  Perhaps that
> could be pushed into a separate source file.
>
> Even if we only got to a point where the sources were the same except for
> a few narrow patches, it would make tracking their updates much easier.
>
> regards, tom lane
>



-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


Re: [HACKERS] Use unique index for longer pathkeys.

2014-07-19 Thread Amit Kapila
On Tue, Jul 15, 2014 at 2:17 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:
>
> Hi, the attached is the revised version.

Thanks Horiguchi-San for the updated patch.

Today while looking into updated patch, I was wondering why can't
we eliminate useless keys in query_pathkeys when we actually build
the same in function standard_qp_callback(), basically somewhat
similar to what we do in
standard_qp_callback->make_pathkeys_for_sortclauses->pathkey_is_redundant().

We already have index information related to base_rels before building
query pathkeys.  I noticed that you mentioned the below in your original
mail which indicates that information related to inheritance tables is built
only after set_base_rel_sizes()

"These steps take place between set_base_rel_sizes() and
set_base_rel_pathlists() in make_one_rel(). The reason for the
position is that the step 2 above needs all inheritence tables to
be extended in PlannerInfo and set_base_rel_sizes (currently)
does that".

Now I am not sure why such information is not build during
build_simple_rel() in below code path:
build_simple_rel()
{
..
if (rte->inh)
{
ListCell   *l;

foreach(l, root->append_rel_list)
{
AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l);

/* append_rel_list contains all append rels; ignore others */
if (appinfo->parent_relid != relid)
continue;

(void) build_simple_rel(root, appinfo->child_relid,
RELOPT_OTHER_MEMBER_REL);
}
}
}

Could you please explain me why the index information built in above
path is not sufficient or is there any other case which I am missing?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com