[HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Atri Sharma
Hi All,

I have been researching bloom filters and discussed it on IRC with
RhodiumToad and David Fetter, and they pointed me to the various
places that could potentially have bloom filters, apart from the
places that already have them currently.

I have been reading the current implementation of hash joins, and in
ExecScanHashBucket, which I understand is the actual lookup function,
we could potentially look at a bloom filter per bucket. Instead of
actually looking up each hash value for the outer relation, we could
just check the corresponding bloom filter for that bucket, and if we
get a positive, then lookup the actual values i.e. continue with our
current behaviour (since we could be looking at a false positive).

This doesn't involve a lot of new logic. We need to add a bloom filter
in HashJoinState and set it when the hash table is being made. Also,
one thing we need to look at is the set of hash functions being used
for the bloom filters. This can be a point of further discussion.

The major potential benefit we could gain is that bloom filters never
give false negatives. So, we could save a lot of lookup where the
corresponding bloom filter gives a negative.

This approach can also be particularly useful for hash anti joins,
where we look for negatives. Since bloom filters can easily be stored
and processed, by straight bit operations, we could be looking at a
lot of saving here.

I am not sure if we already do something like this. Please direct me
to the relevant parts in the code if we already do.

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for dbcommands

2013-06-26 Thread Robins Tharakan
Hi Andres,

Attached is a patch which
does not CREATE DATABASE, but now the regression tests do not test the
following:

- ALTER DATABASE RENAME TO is not allowed on a database in use. Had to
remove two tests that were using this.

- ALTER DATABASE SET TABLESPACE is also not allowed on a database in use,
so removed it (the existing test was supposed to fail too, but it was to
fail at a later stage in the function when checking for whether the
tablespace exists, but with the 'regression' database, it just doesn't even
reach that part)

-
The CREATE DATABASE test itself was checking whether the 'CONNECTION LIMIT'
was working. Removed that as well.


Code coverage improved from 36% to 68%.

--
Robins Tharakan


On 24 June 2013 07:47, Andres Freund and...@2ndquadrant.com wrote:

 On 2013-05-21 02:58:25 +0530, Robins Tharakan wrote:
  Attached is an updated patch that does only 1 CREATE DATABASE and reuses
  that for all other tests.
  The code-coverage with this patch goes up from 36% to 70%.

 Even creating one database seems superfluous. The plain CREATE DATABASE
 has been tested due to the creation of the regression database itself
 anyway?
 All the other tests should be doable using the normal regression db?

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



regress_db_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-06-26 Thread Dean Rasheed
On 26 June 2013 01:22, Josh Berkus j...@agliodbs.com wrote:
 Folks,

 (the below was already discussed on IRC)

 Leaving names aside on this patch, I'm wondering about a piece of
 functionality I have with the current unnest() and with the
 unnest_ordinality()[1] extension: namely, the ability to unnest several
 arrays in parallel by using unnest() in the target list.

 For example, given the table:

 lotsarrays (
 id serial PK,
 arr1 int[],
 arr2 numeric[],
 arr3 boolean[]
 )

 I can currently do:

 SELECT id,
 unnest(arr1) as arr1,
 unnest(arr2) as arr2,
 unnest(arr3) as arr3
 FROM lotsarrays;

 ... and if arr1, 2 and 3 are exactly the same length, this creates a
 coordinated dataset.   I can even use the unnest_ordinality() extension
 function to get the ordinality of this combined dataset:

 SELECT id,
 (unnest_ordinality(arr1)).element_number as array_index,
 unnest(arr1) as arr1,
 unnest(arr2) as arr2,
 unnest(arr3) as arr3
 FROM lotsarrays;

 There are reasons why this will be complicated to implement WITH
 ORDINALITY; DF, Andrew and I discussed them on IRC.  So allowing WITH
 ORDINALITY in the target list is a TODO, either for later in 9.4
 development, or for 9.5.

 So, this isn't stopping the patch; I just want a TODO for implement
 WITH ORDINALITY in the target list for SRFs.


So if I'm understanding correctly, your issue is that WITH ORDINALITY
is currently only accepted on SRFs in the FROM list, not that it isn't
working as expected in any way. I have no objection to adding a TODO
item to extend it, but note that the restriction is trivial to work
around:

CREATE TABLE lotsarrays
(
  id serial primary key,
  arr1 int[],
  arr2 numeric[],
  arr3 boolean[]
);

INSERT INTO lotsarrays(arr1, arr2, arr3) VALUES
  (ARRAY[1,2], ARRAY[1.1, 2.2], ARRAY[true, false]),
  (ARRAY[10,20,30], ARRAY[10.1, 20.2, 30.3], ARRAY[true, false, true]);

CREATE OR REPLACE FUNCTION unnest_ordinality(anyarray)
RETURNS TABLE(element_number bigint, element anyelement) AS
$$
  SELECT ord, elt FROM unnest($1) WITH ORDINALITY AS t(elt, ord)
$$
LANGUAGE sql STRICT IMMUTABLE;

SELECT id,
(unnest_ordinality(arr1)).element_number as array_index,
unnest(arr1) as arr1,
unnest(arr2) as arr2,
unnest(arr3) as arr3
FROM lotsarrays;
 id | array_index | arr1 | arr2 | arr3
+-+--+--+--
  1 |   1 |1 |  1.1 | t
  1 |   2 |2 |  2.2 | f
  2 |   1 |   10 | 10.1 | t
  2 |   2 |   20 | 20.2 | f
  2 |   3 |   30 | 30.3 | t
(5 rows)

Personally I'm not a fan of SRFs in the select list, especially not
multiple SRFs there, since the results are hard to deal with if they
return differing numbers of rows. So I would tend to write this as a
LATERAL FULL join on the ordinality columns:

SELECT id,
   COALESCE(u1.ord, u2.ord, u3.ord) AS array_index,
   u1.arr1, u2.arr2, u3.arr3
  FROM lotsarrays,
   unnest(arr1) WITH ORDINALITY AS u1(arr1, ord)
  FULL JOIN unnest(arr2) WITH ORDINALITY AS u2(arr2, ord) ON u2.ord = u1.ord
  FULL JOIN unnest(arr3) WITH ORDINALITY AS u3(arr3, ord) ON u3.ord =
COALESCE(u1.ord, u2.ord);

Either way, I think the WITH ORDINALITY patch is working as expected.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions

2013-06-26 Thread Szymon Guz
On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote:

 On 06/25/2013 06:42 AM, Szymon Guz wrote:



 Hi,

 I've attached a new patch. I've fixed all the problems you've found,
 except for the efficiency problem, which has been described in previous
 email.

 thanks,
 Szymon


 This version of the patch addresses the issues I mentioned.  Thanks for
 looking into seeing if the performance issue is with our conversions to
 strings or inherit with the python decimal type.  I guess we (Postgresql)
 can't do much about it.   A runtime switch to use cdecimal if it is
 available is a good idea, but I agree with you that could be a different
 patch.

 One minor thing I noticed in this round,

  PLy_elog(ERROR, could not import module 'decimal');

 I think should have decimal in double-quotes.

 I think this patch is ready for a committer to look at it.

 Steve




Hi Steve,
thanks for the review.

I was thinking about speeding up the Decimal conversion using the module
you wrote about. What about trying to import it, if it fails, than trying
to load decimal.Decimal? There will be no warning in logs, just additional
information in documentation that it uses this module if it is available?

thanks,
Szymon


Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List

2013-06-26 Thread Robins
Hi,

Apologies for being unable to respond promptly. I've been traveling
(without much access) and this was the fastest I could settle down. I was
free for months and had to travel smack in the middle of the commitfest.

Incidentally I had reviewed one patch after your direct email, but as
someone earlier mentioned, actually pasting my name as 'reviewer' seemed
awkward and so didn't
then
(but currently its set to David Fetter for some reason).

I guess the point Tom mentioned earlier makes good sense and I agree with
the
spirit of the
process
.
In my part would try to review more patches and mark them so on the
commitfest page
 soon
.
--
Robins Tharakan


On 23 June 2013 23:41, Josh Berkus j...@agliodbs.com wrote:

 Folks,

 For 9.2, we adopted it as policy that anyone submitting a patch to a
 commitfest is expected to review at least one patch submitted by someone
 else.  And that failure to do so would affect the attention your patches
 received in the future.  For that reason, I'm publishing the list below
 of people who submitted patches and have not yet claimed any patch in
 the commitfest to review.

 For those of you who are contributing patches for your company, please
 let your boss know that reviewing is part of contributing, and that if
 you don't do the one you may not be able to do the other.

 The two lists below, idle submitters and committers, constitutes 26
 people.  This *outnumbers* the list of contributors who are busy
 reviewing patches -- some of them four or more patches.  If each of
 these people took just *one* patch to review, it would almost entirely
 clear the list of patches which do not have a reviewer.  If these folks
 continue not to do reviews, this commitfest will drag on for at least 2
 weeks past its closure date.

 Andrew Geirth
 Nick White
 Peter Eisentrout
 Alexander Korotkov
 Etsuro Fujita
 Hari Babu
 Jameison Martin
 Jon Nelson
 Oleg Bartunov
 Chris Farmiloe
 Samrat Revagade
 Alexander Lakhin
 Mark Kirkwood
 Liming Hu
 Maciej Gajewski
 Josh Kuperschmidt
 Mark Wong
 Gurjeet Singh
 Robins Tharakan
 Tatsuo Ishii
 Karl O Pinc

 Additionally, the following committers are not listed as reviewers on
 any patch.  Note that I have no way to search which ones might be
 *committers* on a patch, so these folks are not necessarily slackers
 (although with Bruce, we know for sure):

 Bruce Momjian (momjian)
 Michael Meskes (meskes)
 Teodor Sigaev (teodor)
 Andrew Dunstan (adunstan)
 Magnus Hagander (mha)
 Itagaki Takahiro (itagaki)

 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-26 Thread Robins Tharakan
Hi Szymon,

The commented out test that you're referring to, is an existing test (not
that I added or commented). I was going to remove but interestingly its
testing a part of code where (prima-facie) it should fail, but it passes
(probably why it was disabled in the first place)
!


So technically I hope this regression patch I submitted could go through
since this feedback isn't towards that patch, but in my part I am quite
intrigued about this test (and how it passes) and probably I'd get back on
this thread about this particular commented out test in question, as time
permits.

--
Robins Tharakan


On 25 June 2013 04:12, Robins Tharakan thara...@gmail.com wrote:

  Thanks a ton Szymon (for a reminder on this one).

 As a coincidental turn of events, I have had to travel half way across the
 world and am without my personal laptop (without a linux distro etc.) and
 just recovering from a jet-lag now.

 I'll try to install a VM on a make-shift laptop and get something going to
 respond as soon as is possible.

 Thanks
 --
 Robins Tharakan

 --
 Robins Tharakan


 On 17 June 2013 05:19, Szymon Guz mabew...@gmail.com wrote:

 On 23 May 2013 00:34, Robins Tharakan thara...@gmail.com wrote:

 Hi,

 Please find attached a patch to take code-coverage of CREATE OPERATOR
 (src/backend/commands/operatorcmds.c) from 56% to 91%.

 Any and all feedback is welcome.
 --
 Robins Tharakan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


 Hi,
 there is one commented out test. I think it should be run, or deleted.
 There is no use of commented sql code which is not run.

 What do you think?

 regards,
 Szymon





Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-26 Thread Szymon Guz
OK, so I think this patch can be committed, I will change the status.

thanks,
Szymon

On 26 June 2013 09:26, Robins Tharakan thara...@gmail.com wrote:

 Hi Szymon,

 The commented out test that you're referring to, is an existing test (not
 that I added or commented). I was going to remove but interestingly its
 testing a part of code where (prima-facie) it should fail, but it passes
 (probably why it was disabled in the first place)
 !


 So technically I hope this regression patch I submitted could go through
 since this feedback isn't towards that patch, but in my part I am quite
 intrigued about this test (and how it passes) and probably I'd get back on
 this thread about this particular commented out test in question, as time
 permits.

 --
 Robins Tharakan


 On 25 June 2013 04:12, Robins Tharakan thara...@gmail.com wrote:

  Thanks a ton Szymon (for a reminder on this one).

 As a coincidental turn of events, I have had to travel half way across
 the world and am without my personal laptop (without a linux distro etc.)
 and just recovering from a jet-lag now.

 I'll try to install a VM on a make-shift laptop and get something going
 to respond as soon as is possible.

 Thanks
 --
 Robins Tharakan

 --
 Robins Tharakan


 On 17 June 2013 05:19, Szymon Guz mabew...@gmail.com wrote:

 On 23 May 2013 00:34, Robins Tharakan thara...@gmail.com wrote:

 Hi,

 Please find attached a patch to take code-coverage of CREATE OPERATOR
 (src/backend/commands/operatorcmds.c) from 56% to 91%.

 Any and all feedback is welcome.
 --
 Robins Tharakan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


 Hi,
 there is one commented out test. I think it should be run, or deleted.
 There is no use of commented sql code which is not run.

 What do you think?

 regards,
 Szymon






Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages

2013-06-26 Thread Andres Freund
On 2013-06-26 08:50:27 +0530, Amit Kapila wrote:
 On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote:
  On 2013-06-16 17:19:49 -0700, Josh Berkus wrote:
   Amit posted a new version of this patch on January 23rd.  But last
   comment on it by Tom is not sure everyone wants this.
  
   https://commitfest.postgresql.org/action/patch_view?id=905
  
   ... so, what's the status of this patch?
  
  That comment was referencing a mail of mine - so perhaps I better
  explain:
  
  I think the usecase for this utility isn't big enough to be included in
  postgres since it really can only help in a very limited
  circumstances. And I think it's too likely to be misused for stuff it's
  not useable for (e.g. remastering).
  
  The only scenario I see is that somebody deleted/corrupted
  pg_controldata. In that scenario the tool is supposed to be used to
  find
  the biggest lsn used so far so the user then can use pg_resetxlog to
  set
  that as the wal starting point.
  But that can be way much easier solved by just setting the LSN to
  something very, very high. The database cannot be used for anything
  reliable afterwards anyway.
 
 One of the main reason this was written was to make server up in case of
 corruption and 
 user can take dump of some useful information if any.
 
 By setting LSN very, very high user might loose the information which he
 wants to take dump.

Which information would that loose? We don't currently use the LSN for
tuple visibility. And you sure wouldn't do anything but dump such a
cluster.
Now you could argue that you could modify this to find the current xid
used - but that's not that easy due to the wraparound semantics of
xids. And you are more likely to be successfull by looking at pg_clog.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add visibility map information to pg_freespace.

2013-06-26 Thread Kyotaro HORIGUCHI
Hello, 

 I'm looking into this patch as a reviewer.

I'd appreciate your time to review.


I've had some suggestions so far,

 - I should be cautious in changing existing interface.

  You're right. It was somehow gone out of my mind. It might be
  better to provide a separate function from the compatibility
  view despite the loss of the pertinence to stay in this
  extension. However, it is too small to be a standalone
  extension.

  On the other hand the newly-added-column-to-the-tail could be
  said to be harmless for the most cases considering the usage of
  this extension, I suppose.


 - Historical note is needed in pg_freespace doc.

  Agreed, I'll provide documents not only for freespace, but for
  other modules I'll touch in this patch later.


 - How about pageinspect?

  I proposed a simple representation format as a basis for
  discussion. Nevertheless, the VM pages has no more structure
  than a simple bit string. Given the VM info in pg_freespacemap,
  I've come in doubt of the necessity of vm_page_contnets() for
  the reason besides the orthogonality in the this extension's
  interface (which paid no attention before:-).


 - How about pgstattuple?

  It could even be said to be meaningful to add the number of
  not-all-visible pages or the ratio of it in the total pages..

   | postgres=# select * from pgstattuple('t');
   | -[ RECORD 1 ]+-
   | table_len| 88711168
   | tuple_count  | 61
   | tuple_len| 26400044
   | tuple_percent| 29.76
   | dead_tuple_count | 39
   | dead_tuple_len   | 17599956
   | dead_tuple_percent   | 19.84
   | free_space   | 33607960
   | free_percent | 37.88
   + not_all_visible_page_percent | 23.54

# This column name looks too long, though.

  In addition, the discussion above about the stability of the
  interface is also applicable to this.


Any suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions

2013-06-26 Thread Ronan Dunklau
The v2 patch does not work for me: regression tests for plpython fail on
the plpython_types test: every numeric is converted to None.

It seems the global decimal ctor is not initialized.

Please find two patches, to be applied on top of the v2 patch: one
initializes the decimal ctor, the other uses cdecimal when possible.

Using the performance test from steve, on my machine:

- with cdecimal installed: ~84ms
- without cdecimal installed (standard decimal module): ~511ms


2013/6/26 Szymon Guz mabew...@gmail.com

 On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote:

 On 06/25/2013 06:42 AM, Szymon Guz wrote:



 Hi,

 I've attached a new patch. I've fixed all the problems you've found,
 except for the efficiency problem, which has been described in previous
 email.

 thanks,
 Szymon


 This version of the patch addresses the issues I mentioned.  Thanks for
 looking into seeing if the performance issue is with our conversions to
 strings or inherit with the python decimal type.  I guess we (Postgresql)
 can't do much about it.   A runtime switch to use cdecimal if it is
 available is a good idea, but I agree with you that could be a different
 patch.

 One minor thing I noticed in this round,

  PLy_elog(ERROR, could not import module 'decimal');

 I think should have decimal in double-quotes.

 I think this patch is ready for a committer to look at it.

 Steve




 Hi Steve,
 thanks for the review.

 I was thinking about speeding up the Decimal conversion using the module
 you wrote about. What about trying to import it, if it fails, than trying
 to load decimal.Decimal? There will be no warning in logs, just additional
 information in documentation that it uses this module if it is available?

 thanks,
 Szymon



cnumeric.patch
Description: Binary data


null_ctor.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add visibility map information to pg_freespace.

2013-06-26 Thread Simon Riggs
On 26 June 2013 09:09, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jpwrote:


  - How about pageinspect?

   I proposed a simple representation format as a basis for
   discussion. Nevertheless, the VM pages has no more structure
   than a simple bit string. Given the VM info in pg_freespacemap,
   I've come in doubt of the necessity of vm_page_contnets() for
   the reason besides the orthogonality in the this extension's
   interface (which paid no attention before:-).


I don't think that will be needed, now I understand.


  - How about pgstattuple?

   It could even be said to be meaningful to add the number of
   not-all-visible pages or the ratio of it in the total pages..

| postgres=# select * from pgstattuple('t');
| -[ RECORD 1 ]+-
| table_len| 88711168
| tuple_count  | 61
| tuple_len| 26400044
| tuple_percent| 29.76
| dead_tuple_count | 39
| dead_tuple_len   | 17599956
| dead_tuple_percent   | 19.84
| free_space   | 33607960
| free_percent | 37.88
+ not_all_visible_page_percent | 23.54

 # This column name looks too long, though.


Yes, please.

But name should be all_visible_percent.
Anybody that wants not_all_visible_percent can do the math.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: FILTER for aggregates [was Re: [HACKERS] Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-26 Thread Dean Rasheed
On 26 June 2013 01:01, Josh Berkus j...@agliodbs.com wrote:

 I know it's heresy in these parts, but maybe we should consider
 adopting a non-spec syntax for this feature?  In particular, it's
 really un-obvious why the FILTER clause shouldn't be inside rather
 than outside the aggregate's parens, like ORDER BY.

 Well, what other DBMSes support this feature?  Will being non-spec
 introduce migration pain?


I can't find any, but that doesn't mean they don't exist, or aren't
being worked on.

To recap, the options currently on offer are:

1). Make FILTER a new partially reserved keyword, accepting that that
might break some users' application code.

2). Make FILTER unreserved, accepting that that will lead to syntax
errors rather than more specific error messages if the user tries to
use an aggregate/window function with FILTER or OVER in the FROM
clause of a query, or as an index expression.

3). Adopt a non-standard syntax for this feature, accepting that that
might conflict with other databases, and that we can never then claim
to have implemented T612, Advanced OLAP operations.

4). Some other parser hack that will offer a better compromise?


My preference is for (2) as the lesser of several evils --- it's a
fairly narrow case where the quality of the error message is reduced.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions

2013-06-26 Thread Szymon Guz
Thanks Steve, that's exactly what I wanted to send when you sent your
patches :)

I need to figure out why that patch v2 worked for me, I think I made mess
somewhere in my git repo and didn't create the patch properly. Sorry for
that.

Patch is attached, I've also added information about cdecimal to plpython
documentation.

I'm just wondering how to make integration tests to check when cdecimal is
installed and when it is not.

thanks,
Szymon


On 26 June 2013 10:12, Ronan Dunklau rdunk...@gmail.com wrote:

 The v2 patch does not work for me: regression tests for plpython fail on
 the plpython_types test: every numeric is converted to None.

 It seems the global decimal ctor is not initialized.

 Please find two patches, to be applied on top of the v2 patch: one
 initializes the decimal ctor, the other uses cdecimal when possible.

 Using the performance test from steve, on my machine:

 - with cdecimal installed: ~84ms
 - without cdecimal installed (standard decimal module): ~511ms


 2013/6/26 Szymon Guz mabew...@gmail.com

 On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote:

 On 06/25/2013 06:42 AM, Szymon Guz wrote:



 Hi,

 I've attached a new patch. I've fixed all the problems you've found,
 except for the efficiency problem, which has been described in previous
 email.

 thanks,
 Szymon


 This version of the patch addresses the issues I mentioned.  Thanks for
 looking into seeing if the performance issue is with our conversions to
 strings or inherit with the python decimal type.  I guess we (Postgresql)
 can't do much about it.   A runtime switch to use cdecimal if it is
 available is a good idea, but I agree with you that could be a different
 patch.

 One minor thing I noticed in this round,

  PLy_elog(ERROR, could not import module 'decimal');

 I think should have decimal in double-quotes.

 I think this patch is ready for a committer to look at it.

 Steve




 Hi Steve,
 thanks for the review.

 I was thinking about speeding up the Decimal conversion using the module
 you wrote about. What about trying to import it, if it fails, than trying
 to load decimal.Decimal? There will be no warning in logs, just additional
 information in documentation that it uses this module if it is available?

 thanks,
 Szymon





plpython_decimal_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Yuri Levinsky
 Tom,
I clearly understand your point. I actually came from corporate market
such as Verizon, Barclays... I remember very good that PostgreSQL is
open source, but let's forget it for a moment. The key issue for
corporate market always been a partitioning(vertical and lately
horizontal). Because of that Oracle has too many types and combinations
of partitions, the other vendors as well. Easy partitions maintenance
(automatic, simple syntax) is very important for everybody who lives in
corporate RDBMS world and not only use DB's for free in order to
create some virtual shop. The main purpose of partitioning in my world
is to store billions of rows and be able to search by date, hour or even
minute as fast as possible. When you dealing with company, which has
~350.000.000 users, and you don't want to use key/value data stores: you
need hash partitioned tables and hash partitioned table clusters to
perform fast search and 4-6 tables join based on user phone number for
example.  I believe to increase PostgreSQL popularity in corporate
world, to make real money from support, the next features might be:
better vertical and later horizontal partitioning,  columnar-oriented
tables, DB freeze for NetApp/EMC snapshots and similar.   

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, June 25, 2013 10:33 PM
To: Christopher Browne
Cc: Yuri Levinsky; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists
Subject: Re: [HACKERS] Hash partitioning.

Christopher Browne cbbro...@gmail.com writes:
 There would indeed be merit in improving the partitioning apparatus, 
 and actually, I think it's been a couple of years since there has been

 serious discussion of this.

We could certainly use a partitioning mechanism that's easier to use
than what we have now, which is basically build it yourself, here's the
parts bin.  There would also be some performance benefits from moving
the partitioning logic into hard-wired code.

However, I find it hard to think that hash partitioning as such is very
high on the to-do list.  As was pointed out upthread, the main practical
advantage of partitioning is *not* performance of routine queries, but
improved bulk-data management such as the ability to do periodic
housecleaning by dropping a partition.  If your partitioning is on a
hash, you've thrown away any such advantage, because there's no
real-world meaning to the way the data's been split up.  So I find range
and list partitioning way more plausible.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-26 Thread KONDO Mitsumasa

Thank you for comments!

 On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas

Hmm, so the write patch doesn't do much, but the fsync patch makes the response
times somewhat smoother. I'd suggest that we drop the write patch for now, and

 focus on the fsyncs.
Write patch is effective in TPS! I think that delay of checkpoint write is 
caused
long time fsync and heavy load in fsync phase. Because it go slow disk right in 
write
phase. Therefore, combination of write patch and fsync patch are suiter each 
other than
only write patch. I think that amount of WAL write in beginning of checkpoint can 
indicate effect of write patch.


 What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold 
 settings did you use with the fsync patch? It's disabled by default.

I used these parameters.
  checkpointer_fsync_delay_ratio = 1
  checkpointer_fsync_delay_threshold = 1000ms
As a matter of fact, I used long time sleep in slow fsyncs.

And other maintains parameters are here.
  checkpoint_completion_target = 0.7
  checkpoint_smooth_target = 0.3
  checkpoint_smooth_margin = 0.5
  checkpointer_write_delay = 200ms


Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and 
the
assumption that fsync phase is 10% of the total checkpoint duration. I suspect 
100ms
 is too small to have much effect, but that happens to be what we have 
currently in

CheckpointWriteDelay(). Could you test this patch along with yours? If you can 
test
with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between
the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive 
the
test case is to those settings.
It seems interesting algorithm! I will test it in same setting and study about 
your patch essence.



(2013/06/26 5:28), Heikki Linnakangas wrote:

On 25.06.2013 23:03, Robert Haas wrote:

On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.


Isn't the behavior implemented by the patch a reasonable approximation
of just that?  When the fsyncs start to get slow, that's when we start
to sleep.   I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.


Well, that's the point I was trying to make: you should sleep *before* the 
fsyncs
get slow.
Actuary, fsync time is changed by progress of background disk writes in OS. We 
cannot know about progress of background disk write before fsyncs. I think 
Robert's argument is right. Please see under following log messages.


* fsync file which had been already wrote in disk
 DEBUG:  0: checkpoint sync: number=23 file=base/16384/16413.5 time=2.546 
msec
 DEBUG:  0: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 
msec
 DEBUG:  0: checkpoint sync: number=25 file=base/16384/16413.7 time=2.358 
msec
 DEBUG:  0: checkpoint sync: number=26 file=base/16384/16413.8 time=2.013 
msec
 DEBUG:  0: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535 
msec

 DEBUG:  0: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 
msec

* fsync file which had not been wrote in disk very much
 DEBUG:  0: checkpoint sync: number=54 file=base/16384/16419.8 time=3408.759 
msec
 DEBUG:  0: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075 
msec
 DEBUG:  0: checkpoint sync: number=56 file=base/16384/16419.10 
time=13848.237 msec
 DEBUG:  0: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836 
msec

 DEBUG:  0: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 
msec
 DEBUG:  0: checkpoint sync: number=59 file=base/16384/16419_vm time=0.002 
msec

I think it is wasteful of sleep every fsyncs including short time, and fsync time 
performance is also changed by hardware which is like RAID card and kind of or 
number of disks and OS. So it is difficult to set fixed-sleep-time. My proposed 
method will be more adoptive in these cases.



The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.


Quite possible, I really don't know. I'm inclined to first try the 

Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions

2013-06-26 Thread Szymon Guz
You had a great idea, the time with cdecimal is really great, the
difference on my machine is 64 ms vs 430 ms.

Szymon


Re: FILTER for aggregates [was Re: [HACKERS] Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-26 Thread Pavel Stehule
2013/6/26 Dean Rasheed dean.a.rash...@gmail.com:
 On 26 June 2013 01:01, Josh Berkus j...@agliodbs.com wrote:

 I know it's heresy in these parts, but maybe we should consider
 adopting a non-spec syntax for this feature?  In particular, it's
 really un-obvious why the FILTER clause shouldn't be inside rather
 than outside the aggregate's parens, like ORDER BY.

 Well, what other DBMSes support this feature?  Will being non-spec
 introduce migration pain?


 I can't find any, but that doesn't mean they don't exist, or aren't
 being worked on.

 To recap, the options currently on offer are:

 1). Make FILTER a new partially reserved keyword, accepting that that
 might break some users' application code.

 2). Make FILTER unreserved, accepting that that will lead to syntax
 errors rather than more specific error messages if the user tries to
 use an aggregate/window function with FILTER or OVER in the FROM
 clause of a query, or as an index expression.

 3). Adopt a non-standard syntax for this feature, accepting that that
 might conflict with other databases, and that we can never then claim
 to have implemented T612, Advanced OLAP operations.

 4). Some other parser hack that will offer a better compromise?


 My preference is for (2) as the lesser of several evils --- it's a
 fairly narrow case where the quality of the error message is reduced.


@2 looks well

Pavel

 Regards,
 Dean


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Pavel Stehule
Hello

it looks interesting - please, try to send some performance results,

Regards

Pavel

2013/6/26 Atri Sharma atri.j...@gmail.com:
 Hi All,

 I have been researching bloom filters and discussed it on IRC with
 RhodiumToad and David Fetter, and they pointed me to the various
 places that could potentially have bloom filters, apart from the
 places that already have them currently.

 I have been reading the current implementation of hash joins, and in
 ExecScanHashBucket, which I understand is the actual lookup function,
 we could potentially look at a bloom filter per bucket. Instead of
 actually looking up each hash value for the outer relation, we could
 just check the corresponding bloom filter for that bucket, and if we
 get a positive, then lookup the actual values i.e. continue with our
 current behaviour (since we could be looking at a false positive).

 This doesn't involve a lot of new logic. We need to add a bloom filter
 in HashJoinState and set it when the hash table is being made. Also,
 one thing we need to look at is the set of hash functions being used
 for the bloom filters. This can be a point of further discussion.

 The major potential benefit we could gain is that bloom filters never
 give false negatives. So, we could save a lot of lookup where the
 corresponding bloom filter gives a negative.

 This approach can also be particularly useful for hash anti joins,
 where we look for negatives. Since bloom filters can easily be stored
 and processed, by straight bit operations, we could be looking at a
 lot of saving here.

 I am not sure if we already do something like this. Please direct me
 to the relevant parts in the code if we already do.

 Regards,

 Atri

 --
 Regards,

 Atri
 l'apprenant


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Albe Laurenz
Dean Rasheed wrote:
 How should reviewers get credited in the release notes?

 a) not at all
 b) in a single block titled Reviewers for this version at the bottom.
 c) on the patch they reviewed, for each patch

 
 b) Unless they contribute enough to the patch to be considered a co-author.
 
 
 Should there be a criteria for a creditable review?

 a) no, all reviews are worthwhile
 b) yes, they have to do more than it compiles
 c) yes, only code reviews should count

 
 a) Sometimes even it compiles can be worthwhile, if there is doubt
 over platform compatibility. All contributions should be acknowledged
 and encouraged.
 
 
 Should reviewers for 9.4 get a prize, such as a t-shirt, as a
 promotion to increase the number of non-submitter reviewers?

 a) yes
 b) no
 c) yes, but submitters and committers should get it too

 
 b) Getting your name in the fine manual is reward enough ;-)

+1, except that I like Josh's idea about a free ticket to pgCon.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Atri Sharma
On Wed, Jun 26, 2013 at 2:09 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Hello

 it looks interesting - please, try to send some performance results,

 Regards

 Pavel

Hi,

Sure. I will think more about it and put up a design on the list soon.

My current aim would be to work on hash joins. If it works well for
us, we can look for hash anti joins as well.

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Atri Sharma
For me, B,B and another B works.

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] updated emacs configuration

2013-06-26 Thread Dimitri Fontaine
Peter Eisentraut pete...@gmx.net writes:
 emacs -q src/backend/commands/extension.c
 emacs -q -l ../emacs.samples src/backend/commands/extension.c

 Well, the first one uses 8-space tabs, the second 4-space tabs, so they
 look completely different.  I'm not sure what you are trying to point
 out.

With the .dir-locals.el in place, the first one should be using 4-space
tabs too, right?

   - Name the lambda used in the hook for easier removing / reference

 Interesting, I had never thought of that.  I don't see that used in
 Emacs source code or core packages, however.  Do you have a reference
 that this is recommended practice?

Well a friend of mine pointed that out to me one day, and he's an Emacs
and Gnus commiter. The thing is that you're not guaranteed that the same
lambda source code will evaluate to the same object each time, and that
would prevent you to remove-hook.

 $ git clone git://git.postgresql.org/git/postgresql.git
 Cloning into 'postgresql'...

I can reproduce that here. I don't know why I have those postgres dirs
then, and I'm pretty confused about my round of testing now.

-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-26 Thread Andrew Gierth
Dean Rasheed said:
 To recap, the options currently on offer are:

 1). Make FILTER a new partially reserved keyword, accepting that that
 might break some users' application code.

 2). Make FILTER unreserved, accepting that that will lead to syntax
 errors rather than more specific error messages if the user tries to
 use an aggregate/window function with FILTER or OVER in the FROM
 clause of a query, or as an index expression.

 3). Adopt a non-standard syntax for this feature, accepting that that
 might conflict with other databases, and that we can never then claim
 to have implemented T612, Advanced OLAP operations.

 4). Some other parser hack that will offer a better compromise?

 My preference is for (2) as the lesser of several evils --- it's a
 fairly narrow case where the quality of the error message is reduced.

Possibly significant in this context is that there is a proof-of-concept
patch in development for another part of T612, namely inverse
distribution functions (e.g. percentile_disc and percentile_cont) which
should be available by the next CF, and which will require a similar
decision with respect to the keyword WITHIN (to support doing:
  select percentile_cont(0.5) within group (order by x) from ...;
which returns the median value of x).

This syntax is much more widely supported than FILTER, including by both
Oracle and MSSQL, so a non-standard alternative is much less attractive.
A solution which suits both (i.e. either option 1 or option 2) would make
a whole lot more sense than trying to handle them differently.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] updated emacs configuration

2013-06-26 Thread Jan Urbański

On 26/06/13 10:51, Dimitri Fontaine wrote:

Peter Eisentraut pete...@gmx.net writes:

$ git clone git://git.postgresql.org/git/postgresql.git
Cloning into 'postgresql'...


I can reproduce that here. I don't know why I have those postgres dirs
then, and I'm pretty confused about my round of testing now.


Maybe you cloned from GitHub, where the mirrored repository is called 
'postgres'?


Cheers,
Jan



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Review] Add SPI_gettypmod() to return a field's typemod from a TupleDesc

2013-06-26 Thread Jeevan Chalke
On Wed, Jun 26, 2013 at 7:49 AM, Mark Wong mark...@gmail.com wrote:

 On Tue, Jun 25, 2013 at 1:38 AM, Jeevan Chalke
 jeevan.cha...@enterprisedb.com wrote:
  Hi Mark,
 
  Is this the latest patch you are targeting for 9.4 CF1 ?
 
  I am going to review it.
 
  From the comment, here is one issue you need to resolve first:
 
  *** exec_eval_datum(PLpgSQL_execstate *estat
  *** 4386,4396 
 errmsg(record \%s\ has no field
 \%s\,
rec-refname,
 recfield-fieldname)));
*typeid = SPI_gettypeid(rec-tupdesc, fno);
  ! /* XXX there's no SPI_gettypmod, for some reason */
  ! if (fno  0)
  ! *typetypmod = rec-tupdesc-attrs[fno -
 1]-atttypmod;
  ! else
  ! *typetypmod = -1;
*value = SPI_getbinval(rec-tup, rec-tupdesc, fno,
  isnull);
break;
}
  --- 4386,4392 
 errmsg(record \%s\ has no field
 \%s\,
rec-refname,
 recfield-fieldname)));
*typeid = SPI_gettypeid(rec-tupdesc, fno);
  ! *typetypmod = SPI_gettypeid(rec-tupdesc, fno);
*value = SPI_getbinval(rec-tup, rec-tupdesc, fno,
  isnull);
break;
}
 
 
  Once you confirm, I will go ahead reviewing it.

 Hi Jeevan,

 Oopsies, I've updated the patch and attached it.


Here are my review points:

1. Patch is very simple and straight forward.
2. Applies well with patch command. No issues at all.
3. Regression test passes. We have good coverage for that. Also NO issues
found with my testing.
4. New function is analogous to other SPI_get* functions
5. Ready for committer

However, while walking through your changes, I see following line:
/* XXX there's no SPI_getcollation either */
It says we do need function for SPI_getcollation as well. It will be another
simple patch.

Anyway this is not part of this topic so I will go ahead and mark it as
Ready for committer

Thanks

Regards,
 Mark




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.


Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions

2013-06-26 Thread Ronan Dunklau
It seems like you confused me with steve :)

The patch applies cleanly, and the regression tests pass on python2 when
cdecimal is not installed. When it is, the type info returned for the
converted numeric value is cdecimal.Decimal instead of decimal.Decimal.

The regression tests expected output have not been modified for python3,
and as such they fail on the type conversions.

I am a bit confused with the use of PyModule_GetDict: shouldn't
PyObj_GetAttrString be used directly instead ? Moreover, the reference
count in the current implementation might be off: the reference count for
the decimal module is never decreased, while the reference count to the
module dict is, when the docs say it returns a borrowed reference.

Please find a patch that fixes both issues.




2013/6/26 Szymon Guz mabew...@gmail.com

 Thanks Steve, that's exactly what I wanted to send when you sent your
 patches :)

 I need to figure out why that patch v2 worked for me, I think I made mess
 somewhere in my git repo and didn't create the patch properly. Sorry for
 that.

 Patch is attached, I've also added information about cdecimal to plpython
 documentation.

 I'm just wondering how to make integration tests to check when cdecimal is
 installed and when it is not.

 thanks,
 Szymon


 On 26 June 2013 10:12, Ronan Dunklau rdunk...@gmail.com wrote:

 The v2 patch does not work for me: regression tests for plpython fail on
 the plpython_types test: every numeric is converted to None.

 It seems the global decimal ctor is not initialized.

 Please find two patches, to be applied on top of the v2 patch: one
 initializes the decimal ctor, the other uses cdecimal when possible.

 Using the performance test from steve, on my machine:

 - with cdecimal installed: ~84ms
 - without cdecimal installed (standard decimal module): ~511ms


 2013/6/26 Szymon Guz mabew...@gmail.com

 On 26 June 2013 01:40, Steve Singer st...@ssinger.info wrote:

 On 06/25/2013 06:42 AM, Szymon Guz wrote:



 Hi,

 I've attached a new patch. I've fixed all the problems you've found,
 except for the efficiency problem, which has been described in previous
 email.

 thanks,
 Szymon


 This version of the patch addresses the issues I mentioned.  Thanks for
 looking into seeing if the performance issue is with our conversions to
 strings or inherit with the python decimal type.  I guess we (Postgresql)
 can't do much about it.   A runtime switch to use cdecimal if it is
 available is a good idea, but I agree with you that could be a different
 patch.

 One minor thing I noticed in this round,

  PLy_elog(ERROR, could not import module 'decimal');

 I think should have decimal in double-quotes.

 I think this patch is ready for a committer to look at it.

 Steve




 Hi Steve,
 thanks for the review.

 I was thinking about speeding up the Decimal conversion using the module
 you wrote about. What about trying to import it, if it fails, than trying
 to load decimal.Decimal? There will be no warning in logs, just additional
 information in documentation that it uses this module if it is available?

 thanks,
 Szymon






plpython_decimal_fix_regression.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages

2013-06-26 Thread Amit Kapila
On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote:
 On 2013-06-26 08:50:27 +0530, Amit Kapila wrote:
  On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote:
   On 2013-06-16 17:19:49 -0700, Josh Berkus wrote:
Amit posted a new version of this patch on January 23rd.  But
 last
comment on it by Tom is not sure everyone wants this.
   
https://commitfest.postgresql.org/action/patch_view?id=905
  
... so, what's the status of this patch?
  
   That comment was referencing a mail of mine - so perhaps I better
   explain:
  
   I think the usecase for this utility isn't big enough to be
 included in
   postgres since it really can only help in a very limited
   circumstances. And I think it's too likely to be misused for stuff
 it's
   not useable for (e.g. remastering).
  
   The only scenario I see is that somebody deleted/corrupted
   pg_controldata. In that scenario the tool is supposed to be used to
   find
   the biggest lsn used so far so the user then can use pg_resetxlog
 to
   set
   that as the wal starting point.
   But that can be way much easier solved by just setting the LSN to
   something very, very high. The database cannot be used for anything
   reliable afterwards anyway.
 
  One of the main reason this was written was to make server up in case
 of
  corruption and
  user can take dump of some useful information if any.
 
  By setting LSN very, very high user might loose the information which
 he
  wants to take dump.
 
 Which information would that loose? 
  Information from WAL replay which can be more appropriate by selecting
LSN.
  Also for a developer, guessing very high LSN might be easy, but for users
  it might not be equally easy, and getting such value by utility would be
comfortable. 

  One more use case for which this utility was done is as below:
  It will be used to decide that on new-standby (old-master) whether a full
backup is needed from
  New-master(old-standby).
  The backup is required when the data page in old-master precedes
  the last applied LSN in old-standby (i.e., new-master) at the moment
  of the failover.


 We don't currently use the LSN for
 tuple visibility. And you sure wouldn't do anything but dump such a
 cluster.
 Now you could argue that you could modify this to find the current xid
 used - but that's not that easy due to the wraparound semantics of
 xids. And you are more likely to be successfull by looking at pg_clog.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions

2013-06-26 Thread Szymon Guz
On 26 June 2013 12:04, Ronan Dunklau rdunk...@gmail.com wrote:

 It seems like you confused me with steve :)


Hi Ronan,
Oh, yes. I'm sorry for that :)


 The patch applies cleanly, and the regression tests pass on python2 when
 cdecimal is not installed. When it is, the type info returned for the
 converted numeric value is cdecimal.Decimal instead of decimal.Decimal.

 The regression tests expected output have not been modified for python3,
 and as such they fail on the type conversions.

 I am a bit confused with the use of PyModule_GetDict: shouldn't
 PyObj_GetAttrString be used directly instead ? Moreover, the reference
 count in the current implementation might be off: the reference count for
 the decimal module is never decreased, while the reference count to the
 module dict is, when the docs say it returns a borrowed reference.

 Please find a patch that fixes both issues.


Thanks for the patch. I assume you generated that from clean trunk, and it
includes all the changes (mine and yours) right?

I've checked the patch, everything looks great.
I've attached it to this email with changed name, just for consistent
naming in commitfest app.

thanks,
Szymon


plpython_decimal_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Ants Aasma
On Wed, Jun 26, 2013 at 9:46 AM, Atri Sharma atri.j...@gmail.com wrote:
 I have been researching bloom filters and discussed it on IRC with
 RhodiumToad and David Fetter, and they pointed me to the various
 places that could potentially have bloom filters, apart from the
 places that already have them currently.

One idea that I had was to use them to do CLOG lookups from smaller
datastructures. You could store the list of aborted xids in bloom
filters. When a xid is not found in the filter, then it is known to
have committed, if the filter returns true, then you have to check the
real CLOG. This would achieve for example a 1:8 compression ratio at
99% commit rate and 1:160'000 false positive rate, or 1:16 compression
ratio at 1:400 false positive rate. Nothing too exciting, so I didn't
really develop the idea any further.

 I have been reading the current implementation of hash joins, and in
 ExecScanHashBucket, which I understand is the actual lookup function,
 we could potentially look at a bloom filter per bucket. Instead of
 actually looking up each hash value for the outer relation, we could
 just check the corresponding bloom filter for that bucket, and if we
 get a positive, then lookup the actual values i.e. continue with our
 current behaviour (since we could be looking at a false positive).

The problem here is that if the hash table is in memory, doing a hash
table lookup directly is likely to be cheaper than a bloom filter
lookup, even if the bloom filter fits into the processor cache and the
hash table doesn't (10 last level cache hits is slower than one cache
miss). Bloom filter will help when its not feasible to use an actual
hash table (lots of large keys), the real lookup is slow (e.g. an
index lookup), you are doing a lot of lookups to amortize the
construction cost and the condition is expected to be selective (most
lookups give a negative). There might be some dataware house types of
queries where it would help, but it seems like an awfully narrow use
case with a potential for performance regressions when the planner has
a bad estimate.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages

2013-06-26 Thread Andres Freund
Hi Amit,

On 2013-06-26 16:22:28 +0530, Amit Kapila wrote:
 On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote:
  On 2013-06-26 08:50:27 +0530, Amit Kapila wrote:
   On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote:
On 2013-06-16 17:19:49 -0700, Josh Berkus wrote:
 Amit posted a new version of this patch on January 23rd.  But
  last
 comment on it by Tom is not sure everyone wants this.

 https://commitfest.postgresql.org/action/patch_view?id=905
   
 ... so, what's the status of this patch?
   
That comment was referencing a mail of mine - so perhaps I better
explain:
   
I think the usecase for this utility isn't big enough to be
  included in
postgres since it really can only help in a very limited
circumstances. And I think it's too likely to be misused for stuff
  it's
not useable for (e.g. remastering).
   
The only scenario I see is that somebody deleted/corrupted
pg_controldata. In that scenario the tool is supposed to be used to
find
the biggest lsn used so far so the user then can use pg_resetxlog
  to
set
that as the wal starting point.
But that can be way much easier solved by just setting the LSN to
something very, very high. The database cannot be used for anything
reliable afterwards anyway.
  
   One of the main reason this was written was to make server up in case
  of
   corruption and
   user can take dump of some useful information if any.
  
   By setting LSN very, very high user might loose the information which
  he
   wants to take dump.
  
  Which information would that loose? 
   Information from WAL replay which can be more appropriate by selecting
 LSN.

Sorry, I can't follow. If wal replay still is an option you can just
look at the WAL and get a sensible value way easier. The whole tool
seems to only make sense if you've lost pg_xlog.

   Also for a developer, guessing very high LSN might be easy, but for users
   it might not be equally easy, and getting such value by utility would be
 comfortable.

Well, then we can just document some very high lsn and be done with
it. Like CF00/.
That would leave enough space for eventual writes caused while dumping
the database (say hint bit writes in a checksummed database) and cannot
yet be realistically be reached during normal operation.

   One more use case for which this utility was done is as below:
   It will be used to decide that on new-standby (old-master) whether a full
 backup is needed from
   New-master(old-standby).
   The backup is required when the data page in old-master precedes
   the last applied LSN in old-standby (i.e., new-master) at the moment
   of the failover.

That's exactly what I was afraid of. Unless I miss something the tool is
*NOT* sufficient to do this. Look at the mail introducing pg_rewind and
the ensuing discussion for what's necessary for that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checking variadic any argument in parser - should be array

2013-06-26 Thread Jeevan Chalke
Hi Pavel


On Sat, Jan 26, 2013 at 9:22 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 Hello Tom

 you did comment

 !  * Non-null argument had better be an array.
 The parser doesn't
 !  * enforce this for VARIADIC ANY functions
 (maybe it should?), so
 !  * that check uses ereport not just elog.
 !  */

 So I moved this check to parser.

 It is little bit stricter - requests typed NULL instead unknown NULL,
 but it can mark error better and early


Tom suggested that this check should be better done by parser.
This patch tries to accomplish that.

I will go review this.

However, is it possible to you to re-base it on current master? I can't
apply it using git apply but patch -p1 was succeeded with lot of offset.

Thanks



 Regards

 Pavel


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-26 Thread Heikki Linnakangas

On 26.06.2013 11:37, KONDO Mitsumasa wrote:

On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas

Hmm, so the write patch doesn't do much, but the fsync patch makes
the response
times somewhat smoother. I'd suggest that we drop the write patch
for now, and focus on the fsyncs.


Write patch is effective in TPS!


Your test results don't agree with that. You got 3465.96 TPS with the 
write patch, and 3474.62 and 3469.03 without it. The fsync+write 
combination got slightly more TPS than just the fsync patch, but only by 
about 1%, and then the response times were worse.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)

2013-06-26 Thread Fabien COELHO


Dear Matsumasa,

Here is a v4 that takes into account most of your points: The report is 
performed for all threads by thread 0, however --progress is not supported 
under thread fork emulation if there are more than one thread. The report 
time does not slip anymore.


However I've kept the format scarse. It is a style thing:-) and it is more 
consistent with the kind of format used in the log. I have not added the 
latency measure because it is redundant with the tps, and the latency 
that people are expecting is the actual latency of each transactions, not 
the apparent latency of transactions running in parallel, which is really 
a throuput.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 1303217..707ea37 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -74,7 +74,7 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #include pthread.h
 #else
 /* Use emulation with fork. Rename pthread identifiers to avoid conflicts */
-
+#define PTHREAD_FORK_EMULATION
 #include sys/wait.h
 
 #define pthread_tpg_pthread_t
@@ -164,6 +164,8 @@ bool		use_log;			/* log transaction latencies to a file */
 bool		use_quiet;			/* quiet logging onto stderr */
 int			agg_interval;		/* log aggregates instead of individual
  * transactions */
+int			progress = 0;   /* thread progress report every this seconds */
+int progress_nclients = 0; /* number of clients for progress report */
 bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
@@ -354,6 +356,8 @@ usage(void)
 		  protocol for submitting queries to server (default: simple)\n
 		 -n   do not run VACUUM before tests\n
 		 -N   do not update tables \pgbench_tellers\ and \pgbench_branches\\n
+		 -P NUM, --progress NUM\n
+		  show progress report about every NUM seconds\n
 		 -r   report average latency per command\n
 		 -s NUM   report this scale factor in output\n
 		 -S   perform SELECT-only transactions\n
@@ -2115,6 +2119,7 @@ main(int argc, char **argv)
 		{unlogged-tables, no_argument, unlogged_tables, 1},
 		{sampling-rate, required_argument, NULL, 4},
 		{aggregate-interval, required_argument, NULL, 5},
+		{progress, required_argument, NULL, 'P'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -2181,7 +2186,7 @@ main(int argc, char **argv)
 	state = (CState *) pg_malloc(sizeof(CState));
 	memset(state, 0, sizeof(CState));
 
-	while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:, long_options, optindex)) != -1)
+	while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:, long_options, optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -2336,6 +2341,16 @@ main(int argc, char **argv)
 	exit(1);
 }
 break;
+			case 'P':
+progress = atoi(optarg);
+if (progress = 0)
+{
+	fprintf(stderr,
+	   thread progress delay (-P) must not be negative (%s)\n,
+			optarg);
+	exit(1);
+}
+break;
 			case 0:
 /* This covers long options which take no argument. */
 break;
@@ -2401,6 +2416,15 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+#ifdef PTHREAD_FORK_EMULATION
+	if (progress  nthreads1)
+	{
+		fprintf(stderr,
+option --progress does not work with thread fork emulation);
+		exit(1);
+	}
+#endif /* PTHREAD_FORK_EMULATION */
+
 	/* --sampling-rate may be used only with -l */
 	if (sample_rate  0.0  !use_log)
 	{
@@ -2461,6 +2485,7 @@ main(int argc, char **argv)
 	 * changed after fork.
 	 */
 	main_pid = (int) getpid();
+	progress_nclients = nclients;
 
 	if (nclients  1)
 	{
@@ -2712,6 +2737,11 @@ threadRun(void *arg)
 	int			nstate = thread-nstate;
 	int			remains = nstate;		/* number of remaining clients */
 	int			i;
+	/* for reporting progress: */
+	int64   thread_start = INSTR_TIME_GET_MICROSEC(thread-start_time);
+	int64		last_report = thread_start;
+	int64		next_report = last_report + progress * 100;
+	int64		last_count = 0;
 
 	AggVals		aggs;
 
@@ -2875,6 +2905,30 @@ threadRun(void *arg)
 st-con = NULL;
 			}
 		}
+
+		/* progress report by thread 0 */
+		if (progress  thread-tid == 0)
+		{
+			instr_time now_time;
+			int64 now;
+			INSTR_TIME_SET_CURRENT(now_time);
+			now = INSTR_TIME_GET_MICROSEC(now_time);
+			if (now = next_report)
+			{
+/* generate and show report */
+int64 count = 0;
+int64 run = now - last_report;
+/* thread 0 reports other threads data  */
+for (i = 0 ; i  progress_nclients ; i++)
+	count += state[i].cnt;
+fprintf(stderr, progress: %.1f s %.1f tps\n,
+		(now - thread_start) / 100.0,
+		100.0 * (count-last_count) / run);
+last_count = count;
+last_report = now;
+next_report += progress * 100;
+			}
+		}
 	}
 
 done:
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index 

Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn

2013-06-26 Thread Andres Freund
Hi,

On 2013-06-17 16:16:22 +0200, Andres Freund wrote:
 When postgres on linux receives connection on a high rate client
 connections sometimes error out with:
 could not send data to server: Transport endpoint is not connected
 could not send startup packet: Transport endpoint is not connected
 
 To reproduce start something like on a server with sufficiently high
 max_connections:
 pgbench -h /tmp -p 5440 -T 10 -c 400 -j 400 -n -f /tmp/simplequery.sql
 
 Now that's strange since that error should happen at connect(2) time,
 not when sending the startup packet. Some investigation led me to
 fe-secure.c's PQConnectPoll:

 So, we're accepting EWOULDBLOCK as a valid return value for
 connect(2). Which it isn't. EAGAIN in contrast is on some BSDs and on
 linux. Unfortunately POSIX allows those two to share the same value...
 
 My manpage tells me:
 EAGAIN No more free local ports or insufficient entries in the routing cache. 
  For
AF_INET see the description of
/proc/sys/net/ipv4/ip_local_port_range ip(7)
for information on how to increase the number of local
ports.
 
 So, the problem is that we took a failed connection as having been
 initially successfull but in progress.
 
 Not accepting EWOULDBLOCK in the above if() results in:
 could not connect to server: Resource temporarily unavailable
   Is the server running locally and accepting
   connections on Unix domain socket /tmp/.s.PGSQL.5440?
 
 which makes more sense.
 
 Trivial patch attached.

Could I convince a committer to NACK or commit  backpatch that patch?

It has come up before:
http://www.postgresql.org/message-id/CAMnJ+Beq0hCBuTY_=nz=ru0u-no543_raeunlvsayu8tugd...@mail.gmail.com
possibly also:
http://lists.pgfoundry.org/pipermail/pgpool-general/2007-March/000595.html

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Heikki Linnakangas

On 26.06.2013 11:17, Yuri Levinsky wrote:

The main purpose of partitioning in my world
is to store billions of rows and be able to search by date, hour or even
minute as fast as possible.


Hash partitioning sounds like a bad fit for that use case. A regular 
b-tree, possibly with range partitioning, sounds optimal for that.



When you dealing with company, which has
~350.000.000 users, and you don't want to use key/value data stores: you
need hash partitioned tables and hash partitioned table clusters to
perform fast search and 4-6 tables join based on user phone number for
example.


B-trees are surprisingly fast for key-value lookups. There is no reason 
to believe that a hash partitioned table would be faster for that than a 
plain table.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for dbcommands

2013-06-26 Thread Andres Freund
Hi,

I am generally a bit unsure whether the regression tests you propose
aren't a bit too verbose. Does any of the committers have an opinion
about this?

My feeling is that they are ok if they aren't slowing down things much.

On 2013-06-26 01:55:53 -0500, Robins Tharakan wrote:
 The CREATE DATABASE test itself was checking whether the 'CONNECTION LIMIT'
 was working. Removed that as well.

You should be able to test that by setting the connection limit to 1 and
then try to connect using \c. The old connection is only dropped after
the new one has been successfully performed.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement

2013-06-26 Thread Rushabh Lathia
On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com:
 
 
 
  On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule pavel.steh...@gmail.com
  wrote:
 
  Hello
 
  This is fragment ofmy old code from orafce package - it is functional,
  but it is written little bit more generic due impossible access to
  static variables (in elog.c)
 
  static char*
  dbms_utility_format_call_stack(char mode)
  {
 MemoryContext oldcontext = CurrentMemoryContext;
 ErrorData *edata;
 ErrorContextCallback *econtext;
 StringInfo sinfo;
 
 #if PG_VERSION_NUM = 80400
errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO,
 TEXTDOMAIN);
 #else
errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO);
  #endif
 
 MemoryContextSwitchTo(oldcontext);
 
 for (econtext = error_context_stack;
   econtext != NULL;
   econtext = econtext-previous)
  (*econtext-callback) (econtext-arg);
 
 edata = CopyErrorData();
 FlushErrorState();
 
  https://github.com/orafce/orafce/blob/master/utility.c
 
  2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com:
   Hi,
  
   Use of this feature is to get  call stack for debugging without
 raising
   exception. This definitely seems very useful.
  
   Here are my comments about the submitted patch:
  
   Patch gets applied cleanly (there are couple of white space warning
 but
   that's
   into test case file). make and make install too went smooth. make
 check
   was smooth too. Did some manual testing about feature and its went
   smooth.
  
   However would like to share couple of points:
  
 
  My main motive is concentration to stack info string only. So I didn't
  use err_start function, and I didn't use CopyErrorData too. These
  routines does lot of other work.
 
 
   1) I noticed that you did the initialization of edata manually, just
   wondering
   can't we directly use CopyErrorData() which will do that job ?
  
 
  yes, we can, but in this context on context field is interesting for
 us.
 
   I found that inside CopyErrorData() there is Assert:
  
   Assert(CurrentMemoryContext != ErrorContext);
  
   And in the patch we are setting using ErrorContext as short living
   context,
   is
   it the reason of not using CopyErrorData() ?
 
  it was not a primary reason, but sure - this routine is not designed
  for direct using from elog module. Next, we can save one palloc call.
 
 
  Hmm ok.
 
 
 
  
  
   2) To reset stack to empty, FlushErrorState() can be used.
  
 
  yes, it can be. I use explicit rows due different semantics, although
  it does same things. FlushErrorState some global ErrorState and it is
  used from other modules. I did a reset of ErrorContext. I fill a small
  difference there - so FlushErrorState is not best name for my purpose.
  What about modification:
 
  static void
  resetErrorStack()
  {
  errordata_stack_depth = -1;
  recursion_depth = 0;
  /* Delete all data in ErrorContext */
  MemoryContextResetAndDeleteChildren(ErrorContext);
  }
 
  and then call in  InvokeErrorCallbacks -- resetErrorStack()
 
  and rewrote flushErrorState like
 
  void FlushErrorState(void)
  {
/* reset ErrorStack is enough */
resetErrorStack();
  }
 
  ???
 
 
  Nope. rather then that I would still prefer direct call of
  FlushErrorState().
 
 
 
  but I can live well with direct call of FlushErrorState() - it is only
  minor change.
 
 
   3) I was think about the usability of the feature and wondering how
   about if
   we add some header to the stack, so user can differentiate between
 STACK
   and
   normal NOTICE message ?
  
   Something like:
  
   postgres=# select outer_outer_func(10);
   NOTICE:  - Call Stack -
   PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS
   PL/pgSQL function outer_func(integer) line 3 at RETURN
   PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
   CONTEXT:  PL/pgSQL function outer_func(integer) line 3 at RETURN
   PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
outer_outer_func
   --
  20
   (1 row)
 
  I understand, but I don't think so it is good idea. Sometimes, you
  would to use context for different purposes than debug log - for
  example - you should to identify top most call or near call - and any
  additional lines means some little bit more difficult parsing. You can
  add this line simply (if you want)
 
  RAISE NOTICE e'--- Call Stack ---\n%', stack
 
  but there are more use cases, where you would not this information (or
  you can use own header (different language)), and better returns only
  clean data.
 
 
  I don't know but I still feel like given header from function it self
 would
  be
  nice to have things. Because it will help user to differentiate between
  STACK and normal NOTICE context message.
 
 
 
 
  
   I worked on point 2) and 3) and PFA patch for 

Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement

2013-06-26 Thread Pavel Stehule
2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com:



 On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule pavel.steh...@gmail.com
 wrote:

 2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com:
 
 
 
  On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule pavel.steh...@gmail.com
  wrote:
 
  Hello
 
  This is fragment ofmy old code from orafce package - it is functional,
  but it is written little bit more generic due impossible access to
  static variables (in elog.c)
 
  static char*
  dbms_utility_format_call_stack(char mode)
  {
 MemoryContext oldcontext = CurrentMemoryContext;
 ErrorData *edata;
 ErrorContextCallback *econtext;
 StringInfo sinfo;
 
 #if PG_VERSION_NUM = 80400
errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO,
  TEXTDOMAIN);
 #else
errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO);
  #endif
 
 MemoryContextSwitchTo(oldcontext);
 
 for (econtext = error_context_stack;
   econtext != NULL;
   econtext = econtext-previous)
  (*econtext-callback) (econtext-arg);
 
 edata = CopyErrorData();
 FlushErrorState();
 
  https://github.com/orafce/orafce/blob/master/utility.c
 
  2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com:
   Hi,
  
   Use of this feature is to get  call stack for debugging without
   raising
   exception. This definitely seems very useful.
  
   Here are my comments about the submitted patch:
  
   Patch gets applied cleanly (there are couple of white space warning
   but
   that's
   into test case file). make and make install too went smooth. make
   check
   was smooth too. Did some manual testing about feature and its went
   smooth.
  
   However would like to share couple of points:
  
 
  My main motive is concentration to stack info string only. So I didn't
  use err_start function, and I didn't use CopyErrorData too. These
  routines does lot of other work.
 
 
   1) I noticed that you did the initialization of edata manually, just
   wondering
   can't we directly use CopyErrorData() which will do that job ?
  
 
  yes, we can, but in this context on context field is interesting for
  us.
 
   I found that inside CopyErrorData() there is Assert:
  
   Assert(CurrentMemoryContext != ErrorContext);
  
   And in the patch we are setting using ErrorContext as short living
   context,
   is
   it the reason of not using CopyErrorData() ?
 
  it was not a primary reason, but sure - this routine is not designed
  for direct using from elog module. Next, we can save one palloc call.
 
 
  Hmm ok.
 
 
 
  
  
   2) To reset stack to empty, FlushErrorState() can be used.
  
 
  yes, it can be. I use explicit rows due different semantics, although
  it does same things. FlushErrorState some global ErrorState and it is
  used from other modules. I did a reset of ErrorContext. I fill a small
  difference there - so FlushErrorState is not best name for my purpose.
  What about modification:
 
  static void
  resetErrorStack()
  {
  errordata_stack_depth = -1;
  recursion_depth = 0;
  /* Delete all data in ErrorContext */
  MemoryContextResetAndDeleteChildren(ErrorContext);
  }
 
  and then call in  InvokeErrorCallbacks -- resetErrorStack()
 
  and rewrote flushErrorState like
 
  void FlushErrorState(void)
  {
/* reset ErrorStack is enough */
resetErrorStack();
  }
 
  ???
 
 
  Nope. rather then that I would still prefer direct call of
  FlushErrorState().
 
 
 
  but I can live well with direct call of FlushErrorState() - it is only
  minor change.
 
 
   3) I was think about the usability of the feature and wondering how
   about if
   we add some header to the stack, so user can differentiate between
   STACK
   and
   normal NOTICE message ?
  
   Something like:
  
   postgres=# select outer_outer_func(10);
   NOTICE:  - Call Stack -
   PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS
   PL/pgSQL function outer_func(integer) line 3 at RETURN
   PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
   CONTEXT:  PL/pgSQL function outer_func(integer) line 3 at RETURN
   PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
outer_outer_func
   --
  20
   (1 row)
 
  I understand, but I don't think so it is good idea. Sometimes, you
  would to use context for different purposes than debug log - for
  example - you should to identify top most call or near call - and any
  additional lines means some little bit more difficult parsing. You can
  add this line simply (if you want)
 
  RAISE NOTICE e'--- Call Stack ---\n%', stack
 
  but there are more use cases, where you would not this information (or
  you can use own header (different language)), and better returns only
  clean data.
 
 
  I don't know but I still feel like given header from function it self
  would
  be
  nice to have things. Because it will help user to differentiate between
  STACK and normal NOTICE context 

Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages

2013-06-26 Thread Amit Kapila
On Wednesday, June 26, 2013 4:40 PM Andres Freund wrote:
 Hi Amit,
 
 On 2013-06-26 16:22:28 +0530, Amit Kapila wrote:
  On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote:
   On 2013-06-26 08:50:27 +0530, Amit Kapila wrote:
On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote:
 On 2013-06-16 17:19:49 -0700, Josh Berkus wrote:
  Amit posted a new version of this patch on January 23rd.  But
   last
  comment on it by Tom is not sure everyone wants this.
 
  https://commitfest.postgresql.org/action/patch_view?id=905

  ... so, what's the status of this patch?

 That comment was referencing a mail of mine - so perhaps I
 better
 explain:

 I think the usecase for this utility isn't big enough to be
   included in
 postgres since it really can only help in a very limited
 circumstances. And I think it's too likely to be misused for
 stuff
   it's
 not useable for (e.g. remastering).

 The only scenario I see is that somebody deleted/corrupted
 pg_controldata. In that scenario the tool is supposed to be
 used to
 find
 the biggest lsn used so far so the user then can use
 pg_resetxlog
   to
 set
 that as the wal starting point.
 But that can be way much easier solved by just setting the LSN
 to
 something very, very high. The database cannot be used for
 anything
 reliable afterwards anyway.
   
One of the main reason this was written was to make server up in
 case
   of
corruption and
user can take dump of some useful information if any.
   
By setting LSN very, very high user might loose the information
 which
   he
wants to take dump.
  
   Which information would that loose?
Information from WAL replay which can be more appropriate by
 selecting
  LSN.
 
 Sorry, I can't follow. If wal replay still is an option you can just
 look at the WAL and get a sensible value way easier. 

Originally 2 parts were proposed, one was to get LSN from data pages and
other from data pages.
Original proposal is:
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382851FFA
1@szxeml509-mbs

The second part for looking into WAL was written but due to xlogreader
patch, it was postponed and I didn't get time after that 
to pursue it.


The whole tool
 seems to only make sense if you've lost pg_xlog.

The tool's initial intent was if pg_controldata is lost and this idea is
originated in below mail chain:
http://www.postgresql.org/message-id/4274.1340084...@sss.pgh.pa.us


Also for a developer, guessing very high LSN might be easy, but for
 users
it might not be equally easy, and getting such value by utility
 would be
  comfortable.
 
 Well, then we can just document some very high lsn and be done with
 it. Like CF00/.
 That would leave enough space for eventual writes caused while dumping
 the database (say hint bit writes in a checksummed database) and cannot
 yet be realistically be reached during normal operation.

Can we be ultra sure, that this LSN is not reached. I think it will take
vary long to reach such LSN, but still theoretically it can be possible.
I don't have any evidence. 
 
One more use case for which this utility was done is as below:
It will be used to decide that on new-standby (old-master) whether
 a full
  backup is needed from
New-master(old-standby).
The backup is required when the data page in old-master precedes
the last applied LSN in old-standby (i.e., new-master) at the
 moment
of the failover.
 
 That's exactly what I was afraid of. Unless I miss something the tool
 is
 *NOT* sufficient to do this. 

You mean to say if user knows the max LSN of data pages in old-master and
last applied LSN in new master, he cannot decide whether 
Full backup is needed? It should be straightforward decision that skip a
backup if that old-master LSN is less than the new-master (i.e., last
applied LSN, IOW, timeline switch LSN).
It was proposed as a usecase in this below mail:
http://www.postgresql.org/message-id/CAHGQGwHyd1fY0hF0qKh0-uKDh-gcbYxMOFBYVk
kh4jzji-f...@mail.gmail.com

 Look at the mail introducing pg_rewind and
 the ensuing discussion for what's necessary for that.

I had briefly looked into that discussion at the time it was going on, but I
will look into it more carefully.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Atri Sharma
 One idea that I had was to use them to do CLOG lookups from smaller
 datastructures. You could store the list of aborted xids in bloom
 filters. When a xid is not found in the filter, then it is known to
 have committed, if the filter returns true, then you have to check the
 real CLOG. This would achieve for example a 1:8 compression ratio at
 99% commit rate and 1:160'000 false positive rate, or 1:16 compression
 ratio at 1:400 false positive rate. Nothing too exciting, so I didn't
 really develop the idea any further.

Right.



 The problem here is that if the hash table is in memory, doing a hash
 table lookup directly is likely to be cheaper than a bloom filter
 lookup, even if the bloom filter fits into the processor cache and the
 hash table doesn't (10 last level cache hits is slower than one cache
 miss). Bloom filter will help when its not feasible to use an actual
 hash table (lots of large keys), the real lookup is slow (e.g. an
 index lookup), you are doing a lot of lookups to amortize the
 construction cost and the condition is expected to be selective (most
 lookups give a negative). There might be some dataware house types of
 queries where it would help, but it seems like an awfully narrow use
 case with a potential for performance regressions when the planner has
 a bad estimate.

Ok, sounds good. Cant we use bloom filters for the case where the hash
table doesnt fit in memory? Specifically, when reading from disk is
inevitable for accessing the hash table, we can use bloom filters for
deciding which keys to actually read from disk.Thus ,we save I/O costs
which we would have incurred when reading the keys which are not the
ones we want and bloom filter gives us a negative.

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-26 Thread Amit Kapila
On Tuesday, June 25, 2013 10:25 AM Amit Kapila wrote:
 On Monday, June 24, 2013 11:00 PM Robert Haas wrote:
  On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila amit.kap...@huawei.com
  wrote:
   To avoid above 3 factors in test readings, I used below steps:
   1. Initialize the database with scale factor such that database
 size
  +
   shared_buffers = RAM (shared_buffers = 1/4 of RAM).
  For example:
  Example -1
   if RAM = 128G, then initialize db with scale factor
 =
  6700
   and shared_buffers = 32GB.
   Database size (98 GB) + shared_buffers (32GB) = 130
  (which
   is approximately equal to total RAM)
  Example -2 (this is based on your test m/c)
   If RAM = 64GB, then initialize db with scale factor
 =
  3400
   and shared_buffers = 16GB.
   2. reboot m/c
   3. Load all buffers with data (tables/indexes of pgbench) using
  pg_prewarm.
   I had loaded 3 times, so that usage count of buffers will be
  approximately
   3.
 
  Hmm.  I don't think the usage count will actually end up being 3,
  though, because the amount of data you're loading is sized to 3/4 of
  RAM, and shared_buffers is just 1/4 of RAM, so I think that each run
  of pg_prewarm will end up turning over the entire cache and you'll
  never get any usage counts more than 1 this way.  Am I confused?
 
 The way I am pre-warming is that loading the data of relation
 (table/index)
 continuously 3 times, so mostly the buffers will contain the data of
 relations loaded in last
 which are indexes and also they got accessed more during scans. So
 usage
 count should be 3.
 Can you please once see load_all_buffers.sql, may be my understanding
 has
 some gap.
 
 Now about the question why then load all the relations.
 Apart from PostgreSQL shared buffers, loading data this way can also
 make sure OS buffers will have the data with higher usage count which
 can
 lead to better OS scheduling.
 
  I wonder if it would be beneficial to test the case where the
 database
  size is just a little more than shared_buffers.  I think that would
  lead to a situation where the usage counts are high most of the time,
  which - now that you mention it - seems like the sweet spot for this
  patch.
 
 I will check this case and take the readings for same. Thanks for your
 suggestions.

Configuration Details
O/S - Suse-11
RAM - 128GB
Number of Cores - 16
Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench -
Select-only Scalefactor - 1200 Time - 30 mins

 8C-8T16C-16T32C-32T64C-64T 
Head   62403101810 99516  94707 
Patch  62827101404 99109  94744

On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared
buffers, this is no major difference.
One of the reasons could be that there is no much swapping in shared buffers
as most data already fits in shared buffers.


I think more readings are need for combinations related to below settings:
scale factor such that database size + shared_buffers = RAM (shared_buffers
= 1/4 of RAM).

I can try varying shared_buffer size.

Kindly let me know your suggestions?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Simon Riggs
On 26 June 2013 07:46, Atri Sharma atri.j...@gmail.com wrote:


 I have been researching bloom filters and discussed it on IRC with
 RhodiumToad and David Fetter, and they pointed me to the various
 places that could potentially have bloom filters, apart from the
 places that already have them currently.

 I have been reading the current implementation of hash joins, and in
 ExecScanHashBucket, which I understand is the actual lookup function,
 we could potentially look at a bloom filter per bucket. Instead of
 actually looking up each hash value for the outer relation, we could
 just check the corresponding bloom filter for that bucket, and if we
 get a positive, then lookup the actual values i.e. continue with our
 current behaviour (since we could be looking at a false positive).


Exactly this was suggested by me on the NTUP_PER_BUCKET thread last week.

Probably good idea to join in there also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Atri Sharma
On Wed, Jun 26, 2013 at 5:47 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 26 June 2013 07:46, Atri Sharma atri.j...@gmail.com wrote:


 I have been researching bloom filters and discussed it on IRC with
 RhodiumToad and David Fetter, and they pointed me to the various
 places that could potentially have bloom filters, apart from the
 places that already have them currently.

 I have been reading the current implementation of hash joins, and in
 ExecScanHashBucket, which I understand is the actual lookup function,
 we could potentially look at a bloom filter per bucket. Instead of
 actually looking up each hash value for the outer relation, we could
 just check the corresponding bloom filter for that bucket, and if we
 get a positive, then lookup the actual values i.e. continue with our
 current behaviour (since we could be looking at a false positive).


 Exactly this was suggested by me on the NTUP_PER_BUCKET thread last week.

 Probably good idea to join in there also.


Great.I would love to assist you in implementing this, if possible.

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement

2013-06-26 Thread Rushabh Lathia
On Wed, Jun 26, 2013 at 5:11 PM, Pavel Stehule pavel.steh...@gmail.comwrote:

 2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com:
 
 
 
  On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule pavel.steh...@gmail.com
 
  wrote:
 
  2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com:
  
  
  
   On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule 
 pavel.steh...@gmail.com
   wrote:
  
   Hello
  
   This is fragment ofmy old code from orafce package - it is
 functional,
   but it is written little bit more generic due impossible access to
   static variables (in elog.c)
  
   static char*
   dbms_utility_format_call_stack(char mode)
   {
  MemoryContext oldcontext = CurrentMemoryContext;
  ErrorData *edata;
  ErrorContextCallback *econtext;
  StringInfo sinfo;
  
  #if PG_VERSION_NUM = 80400
 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO,
   TEXTDOMAIN);
  #else
 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO);
   #endif
  
  MemoryContextSwitchTo(oldcontext);
  
  for (econtext = error_context_stack;
econtext != NULL;
econtext = econtext-previous)
   (*econtext-callback) (econtext-arg);
  
  edata = CopyErrorData();
  FlushErrorState();
  
   https://github.com/orafce/orafce/blob/master/utility.c
  
   2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com:
Hi,
   
Use of this feature is to get  call stack for debugging without
raising
exception. This definitely seems very useful.
   
Here are my comments about the submitted patch:
   
Patch gets applied cleanly (there are couple of white space warning
but
that's
into test case file). make and make install too went smooth. make
check
was smooth too. Did some manual testing about feature and its went
smooth.
   
However would like to share couple of points:
   
  
   My main motive is concentration to stack info string only. So I
 didn't
   use err_start function, and I didn't use CopyErrorData too. These
   routines does lot of other work.
  
  
1) I noticed that you did the initialization of edata manually,
 just
wondering
can't we directly use CopyErrorData() which will do that job ?
   
  
   yes, we can, but in this context on context field is interesting
 for
   us.
  
I found that inside CopyErrorData() there is Assert:
   
Assert(CurrentMemoryContext != ErrorContext);
   
And in the patch we are setting using ErrorContext as short living
context,
is
it the reason of not using CopyErrorData() ?
  
   it was not a primary reason, but sure - this routine is not designed
   for direct using from elog module. Next, we can save one palloc call.
  
  
   Hmm ok.
  
  
  
   
   
2) To reset stack to empty, FlushErrorState() can be used.
   
  
   yes, it can be. I use explicit rows due different semantics, although
   it does same things. FlushErrorState some global ErrorState and it is
   used from other modules. I did a reset of ErrorContext. I fill a
 small
   difference there - so FlushErrorState is not best name for my
 purpose.
   What about modification:
  
   static void
   resetErrorStack()
   {
   errordata_stack_depth = -1;
   recursion_depth = 0;
   /* Delete all data in ErrorContext */
   MemoryContextResetAndDeleteChildren(ErrorContext);
   }
  
   and then call in  InvokeErrorCallbacks -- resetErrorStack()
  
   and rewrote flushErrorState like
  
   void FlushErrorState(void)
   {
 /* reset ErrorStack is enough */
 resetErrorStack();
   }
  
   ???
  
  
   Nope. rather then that I would still prefer direct call of
   FlushErrorState().
  
  
  
   but I can live well with direct call of FlushErrorState() - it is
 only
   minor change.
  
  
3) I was think about the usability of the feature and wondering how
about if
we add some header to the stack, so user can differentiate between
STACK
and
normal NOTICE message ?
   
Something like:
   
postgres=# select outer_outer_func(10);
NOTICE:  - Call Stack -
PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS
PL/pgSQL function outer_func(integer) line 3 at RETURN
PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
CONTEXT:  PL/pgSQL function outer_func(integer) line 3 at RETURN
PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
 outer_outer_func
--
   20
(1 row)
  
   I understand, but I don't think so it is good idea. Sometimes, you
   would to use context for different purposes than debug log - for
   example - you should to identify top most call or near call - and any
   additional lines means some little bit more difficult parsing. You
 can
   add this line simply (if you want)
  
   RAISE NOTICE e'--- Call Stack ---\n%', stack
  
   but there are more use cases, where you would not this information
 (or
   you can use own header 

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-26 Thread Atri Sharma
On Sun, Jun 23, 2013 at 8:41 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 23.06.2013 01:48, Simon Riggs wrote:

 On 22 June 2013 21:40, Stephen Frostsfr...@snowman.net  wrote:

 I'm actually not a huge fan of this as it's certainly not cheap to do. If
 it
 can be shown to be better than an improved heuristic then perhaps it
 would
 work but I'm not convinced.


 We need two heuristics, it would seem:

 * an initial heuristic to overestimate the number of buckets when we
 have sufficient memory to do so

 * a heuristic to determine whether it is cheaper to rebuild a dense
 hash table into a better one.

 Although I like Heikki's rebuild approach we can't do this every x2
 overstretch. Given large underestimates exist we'll end up rehashing
 5-12 times, which seems bad.


 It's not very expensive. The hash values of all tuples have already been
 calculated, so rebuilding just means moving the tuples to the right bins.


 Better to let the hash table build and
 then re-hash once, it we can see it will be useful.


 That sounds even less expensive, though.

Hi all,

I just popped in here on Simon's advice to put an idea I had about
optimizing hash joins on this thread.

Essentially, I was thinking of using bloom filters in the part where
we match the actual hash values of the outer relation's tuple and the
hash table. We could do a bloom filter lookup first, and if it gives a
positive, then we can proceed like we do currently, since we could
have a false positive. However, if we have a negative from the bloom
filter, then, we can skip that tuple since bloom filters never give
false negatives.

Another thing we could potentially look at is that in the case when
the hash table doesnt fit into memory, and we have to do disk lookups,
then, we could do bloom filter lookups first in order to select the
tuples to be read from disk(only the positive ones).

Thoughts/Comments please?

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Bruce Momjian
On Tue, Jun 25, 2013 at 02:52:33PM -0700, Kevin Grittner wrote:
 Claudio Freire klaussfre...@gmail.com wrote:
 
  Did you try select * from foo where (a % 16) = (1::int % 16)?
 
 I did.  Using Robert's hashed partitioning table definitions:
 
 test=# explain select * from foo where a = 1 and (a % 16) = (1 % 16);
  QUERY PLAN    
 
  Append  (cost=0.00..31.53 rows=2 width=36)
    -  Seq Scan on foo  (cost=0.00..0.00 rows=1 width=36)
  Filter: ((a = 1) AND ((a % 16) = 1))
    -  Seq Scan on foo1  (cost=0.00..31.53 rows=1 width=36)
  Filter: ((a = 1) AND ((a % 16) = 1))
 (5 rows)
 
 So if you are generating your queries through something capable of
 generating that last clause off of the first, this could work.  Not

OK, so what is it in our code that requires that?  It is a type
mismatch?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Bruce Momjian
On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote:
 On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote:
  How should reviewers get credited in the release notes?
 
  a) not at all
  b) in a single block titled Reviewers for this version at the bottom.
  c) on the patch they reviewed, for each patch
 
 A weak preference for (c), with (b) running a close second.  As others
 have suggested, a review that leads to significant commitable changes
 to the patch should bump the credit to co-authorship.

As a reminder, I tried a variant of C for 9.2 beta release notes, and
got lots of complaints, particularly because the line describing the
feature now had many more names on it.

In my opinion, adding reviewer names to each feature item might result
in the removal of all names from features.

A poll is nice for gauging interest, but many people who vote don't
understand the ramifications of what they are voting on.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Andrew Dunstan


On 06/26/2013 09:14 AM, Bruce Momjian wrote:

On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote:

On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote:

How should reviewers get credited in the release notes?

a) not at all
b) in a single block titled Reviewers for this version at the bottom.
c) on the patch they reviewed, for each patch

A weak preference for (c), with (b) running a close second.  As others
have suggested, a review that leads to significant commitable changes
to the patch should bump the credit to co-authorship.

As a reminder, I tried a variant of C for 9.2 beta release notes, and
got lots of complaints, particularly because the line describing the
feature now had many more names on it.

In my opinion, adding reviewer names to each feature item might result
in the removal of all names from features.

A poll is nice for gauging interest, but many people who vote don't
understand the ramifications of what they are voting on.




That's why I voted for b :-)

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-06-26 Thread Amit Kapila
On Wednesday, June 26, 2013 2:15 AM Heikki Linnakangas wrote:
 On 19.06.2013 14:01, Amit Kapila wrote:
  Observations
  --
  1. For small data perforamce is always good with patch.
  2. For random small/large data performace is good.
  3. For medium and large text and same byte data(3K,5K text,
  10K,100K,500K same byte), performance is degraded.
 
 Wow, that's strange. What platform and CPU did you test on? 

CPU - 4 core 
RAM - 24GB 
OS  - SUSE 11 SP2
Kernel version - 3.0.13

 Are you
 sure you used the same compiler flags with and without the patch?

Yes.
 
 Can you also try the attached patch, please? It's the same as before,
 but in this version, I didn't replace the prev and next pointers in
 PGLZ_HistEntry struct with int16s. That avoids some table lookups, at
 the expense of using more memory. It's closer to what we have without
 the patch, so maybe that helps on your system.

Yes it helped a lot on my system.

Head: 
 testname  |   auto 
---+--- 
 5k text   |  3499.888 
 512b text |  1425.106 
 256b text |  1769.126 
 1K text   |  1378.151 
 3K text   |  4081.254 
 2k random |  -410.928 
 100k random   |   -10.235 
 500k random   |-2.094 
 512b random   |  -770.665 
 256b random   | -1120.173 
 1K random |  -570.351 
 10k of same byte  |  3602.610 
 100k of same byte | 36187.863 
 500k of same byte | 26055.472

After your Patch (pglz-variable-size-hash-table-2.patch)

 testname  |   auto 
---+--- 
 5k text   |  3524.306 
 512b text |   954.962 
 256b text |   832.527 
 1K text   |  1273.970 
 3K text   |  3963.329 
 2k random |  -300.516 
 100k random   |-7.538 
 500k random   |-1.525 
 512b random   |  -439.726 
 256b random   |  -440.154 
 1K random |  -391.070 
 10k of same byte  |  3570.921 
 100k of same byte | 37498.502 
 500k of same byte | 26904.426

There was minor problem in you patch, in one of experiments it crashed.
Fix is not to access 0th history entry in function pglz_find_match(),
modified patch is attached.

After fix, readings are almost similar:

testname   |   auto 
---+--- 
 5k text   |  3347.961 
 512b text |   938.442 
 256b text |   834.496 
 1K text   |  1243.618 
 3K text   |  3790.835 
 2k random |  -306.470 
 100k random   |-7.589 
 500k random   |-1.517 
 512b random   |  -442.649 
 256b random   |  -438.781 
 1K random |  -392.106 
 10k of same byte  |  3565.449 
 100k of same byte | 37355.595 
 500k of same byte | 26776.076



I guess some difference might be due to different way of accessing in
pglz_hist_add().


With Regards,
Amit Kapila.


pglz-variable-size-hash-table-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Yuri Levinsky
Heikki,
As far as I understand the height of the btree will affect the number of
I/Os necessary. The height of the tree does not increase linearly with
the number of records. May be I wrong in terminology but when I am
trying to insert data into empty table the insertion time is increasing
when number of records is growing. In order to keep indexes as small as
possible I usually split the table by hash if I don't have any better
alternative. On some systems hash functions +index might work faster
when only index for insert and search operations. This especially usable
when you have non unique index with small number of possible values that
you don't know in advance or that changing between your customers. In
that case the hash partition has to be used instead of index. 

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-Original Message-
From: Heikki Linnakangas [mailto:hlinnakan...@vmware.com] 
Sent: Wednesday, June 26, 2013 2:23 PM
To: Yuri Levinsky
Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL
Mailing Lists
Subject: Re: [HACKERS] Hash partitioning.

On 26.06.2013 11:17, Yuri Levinsky wrote:
 The main purpose of partitioning in my world is to store billions of 
 rows and be able to search by date, hour or even minute as fast as 
 possible.

Hash partitioning sounds like a bad fit for that use case. A regular
b-tree, possibly with range partitioning, sounds optimal for that.

 When you dealing with company, which has
 ~350.000.000 users, and you don't want to use key/value data stores: 
 you need hash partitioned tables and hash partitioned table clusters 
 to perform fast search and 4-6 tables join based on user phone number 
 for example.

B-trees are surprisingly fast for key-value lookups. There is no reason
to believe that a hash partitioned table would be faster for that than a
plain table.

- Heikki

This mail was received via Mail-SeCure System.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] updated emacs configuration

2013-06-26 Thread Noah Misch
On Tue, Jun 25, 2013 at 11:17:47PM -0400, Peter Eisentraut wrote:
 On Sun, 2013-06-23 at 16:03 -0400, Noah Misch wrote:
   ((c-mode . ((c-basic-offset . 4)
   (fill-column . 79)
  
  I don't know whether you'd consider it to fall within the scope of this
  update, but 78 is the fill-column setting that matches pgindent.
 
 Well, well, well.  I did some extensive tests some time ago when that
 setting was added.  I have a suspicion that this could be related to the
 recent pgindent changes (which everyone claims were no changes).  I'm
 going to have to research that some more.

pgindent has not changed the following xlog.c comment since April 2011, but
fill-column 77 or 79 changes it:

/*
 * fullPageWrites is the master copy used by all backends to determine
 * whether to write full-page to WAL, instead of using process-local 
one.
 * This is required because, when full_page_writes is changed by SIGHUP,
 * we must WAL-log it before it actually affects WAL-logging by 
backends.
 * Checkpointer sets at startup or after SIGHUP.
 */

Note that emacs and pgindent remain at odds over interior tabs in comments.
When pgindent finds a double-space (typically after a sentence) ending at a
tab stop, it replaces the double-space with a tab.  c-fill-paragraph will
convert that tab to a *single* space, and that can be enough to change many
line break positions.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-26 Thread Stephen Frost
Atri,

* Atri Sharma (atri.j...@gmail.com) wrote:
 I just popped in here on Simon's advice to put an idea I had about
 optimizing hash joins on this thread.

I'd encourage reading the thread a bit first, in the future.. :)

 Essentially, I was thinking of using bloom filters in the part where
 we match the actual hash values of the outer relation's tuple and the
 hash table. We could do a bloom filter lookup first, and if it gives a
 positive, then we can proceed like we do currently, since we could
 have a false positive. However, if we have a negative from the bloom
 filter, then, we can skip that tuple since bloom filters never give
 false negatives.

I suggested this up-thread already, but it's not really a bloom filter
as there's only one hash function available- I can't see us requiring
every data type to provide multiple hash functions.  Still, I do think
breaking the single 32-bit hash key space up into fixed-sized chunks and
then having a bitfield array which we test against (very similar to how
the visibility map works) to see if there's any chance that a given hash
key exists might be valuable.  The problem is that, because we don't
have multiple hash functions, it's not clear how much empty space we'd
actually end up with.

This needs to be tested with different data sets and different sizes for
the bitfield (leading to correspondingly different sizes for the amount
of key space covered  by a single bit), along with performance metrics
which determine how expensive it is to build and use the bitfield.

 Another thing we could potentially look at is that in the case when
 the hash table doesnt fit into memory, and we have to do disk lookups,
 then, we could do bloom filter lookups first in order to select the
 tuples to be read from disk(only the positive ones).

We could have a bitfield filter (as I described above) created for each
bucket and then test against that before considering if we actually have
to go look in that bucket, yes.  I'm not sure if that's quite what you
were thinking, but I can see how a bitfield per bucket might work.  If
you were suggesting something else, please clarify.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner
On 06/25/2013 11:52 PM, Kevin Grittner wrote:
 At least until we have parallel
 query execution.  At *that* point this all changes.

Can you elaborate on that, please? I currently have a hard time
imagining how partitions can help performance in that case, either. At
least compared to modern RAID and read-ahead capabilities.

After all, RAID can be thought of as hash partitioning with a very weird
hash function. Or maybe rather range partitioning on an internal key.

Put another way: ideally, the system should take care of optimally
distributing data across its physical storage itself. If you need to do
partitioning manually for performance reasons, that's actually a
deficiency of it, not a feature.

I certainly agree that manageability may be a perfectly valid reason to
partition your data. Maybe there even exist other good reasons. I don't
think performance optimization is one. (It's more like giving the system
a hint. And we all dislike hints, don't we? *ducks*)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread k...@rice.edu
On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote:
 On 06/25/2013 11:52 PM, Kevin Grittner wrote:
  At least until we have parallel
  query execution.  At *that* point this all changes.
 
 Can you elaborate on that, please? I currently have a hard time
 imagining how partitions can help performance in that case, either. At
 least compared to modern RAID and read-ahead capabilities.
 
 After all, RAID can be thought of as hash partitioning with a very weird
 hash function. Or maybe rather range partitioning on an internal key.
 
 Put another way: ideally, the system should take care of optimally
 distributing data across its physical storage itself. If you need to do
 partitioning manually for performance reasons, that's actually a
 deficiency of it, not a feature.
 
 I certainly agree that manageability may be a perfectly valid reason to
 partition your data. Maybe there even exist other good reasons. I don't
 think performance optimization is one. (It's more like giving the system
 a hint. And we all dislike hints, don't we? *ducks*)
 
 Regards
 
 Markus Wanner
 

Hi Markus,

I think he is referring to the fact that with parallel query execution,
multiple partitions can be processed simultaneously instead of serially
as they are now with the resulting speed increase.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]

2013-06-26 Thread Jamie Martin
FYI I submitted a slightly modified patch since Amit's measurements that is 
slightly faster. 

On Jun 25, 2013, at 1:26 PM, Amit Kapila amit.kap...@huawei.com wrote:

 On Monday, June 24, 2013 8:20 PM Tom Lane wrote:
 Amit Kapila amit.kap...@huawei.com writes:
 I will summarize the results, and if most of us feel that they are
 not good
 enough, then we can return this patch.
 
 Aside from the question of whether there's really any generally-useful
 performance improvement from this patch, it strikes me that this change
 forecloses other games that we might want to play with interpretation
 of
 the value of a tuple's natts.
 
 In particular, when I was visiting Salesforce last week, the point came
 up that they'd really like ALTER TABLE ADD COLUMN to be free even
 when
 the column had a non-null DEFAULT.  It's not too difficult to imagine
 how we might support that, at least when the default expression is a
 constant: decree that if the requested attribute number is beyond
 natts,
 look aside at the tuple descriptor to see if the column is marked as
 having a magic default value, and if so, substitute that rather than
 just returning NULL.  (This has to be a magic value separate from
 the current default, else a subsequent DROP or SET DEFAULT would do
 the wrong thing.)
 
 However, this idea conflicts with an optimization that supposes it can
 reduce natts to suppress null columns: if the column was actually
 stored
 as null, you'd lose that information, and would incorrectly return the
 magic default on subsequent reads.
 
 I think it might be possible to make both ideas play together, by
 not reducing natts further than the last column with a magic default.
 However, that means extra complexity in heap_form_tuple, which would
 invalidate the performance measurements done in support of this patch.
 
  It can have slight impact on normal scenario's, but I am not sure how much
 because
  the change will be very less(may be one extra if check and one assignment)
 
  For this Patch's scenario, I think the major benefit for Performance is in
 heap_fill_tuple() where the 
  For loop is reduced. However added some logic in heap_form_tuple can
 reduce the performance improvement, 
  but there can still be space saving benefit. 
 
 With Regards,
 Amit Kapila.
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Heikki Linnakangas

On 26.06.2013 16:41, Yuri Levinsky wrote:

Heikki,
As far as I understand the height of the btree will affect the number of
I/Os necessary. The height of the tree does not increase linearly with
the number of records.


The height of a b-tree is O(log n), where n is the number of records. 
Informally, if we assume that you have on average, say, 1000 keys on one 
b-tree page, a two-level b-tree can hold one million items, and a three 
level one billion items, and so on. The height of the tree affects the 
number of I/Os needed for searches, but keep in mind that the top levels 
of the tree are going to be very frequently accessed and in practice 
will stay permanently cached. You will only perform actual I/O on the 
1-2 bottom levels of the tree (or 0 if it all fits in cache)


Now let's compare that with a hash partitioned table, with 1000 
partitions and a b-tree index on every partition. When doing a search, 
you first hash the key to look up the right partition, then you search 
the index of that partition. This is almost equivalent to just having a 
b-tree that's one level taller - instead of looking up the right 
partition in the hash table, you look up the right child page at the 
root of the b-tree. From a very coarse theoretical point of view, the 
only difference is that you replaced the binary search on the b-tree 
root page with an equivalent hash lookup. A hash lookup can be somewhat 
cheaper than binary search, but in practice there is very little 
difference. There certainly isn't any difference in the number of actual 
I/O performed.


In practice, there might be a lot of quirks and inefficiencies and 
locking contention etc. involved in various DBMS's, that you might be 
able to work around with hash partitioning. But from a theoretical point 
of view, there is no reason to expect just partitioning a table on a 
hash to make key-value lookups any faster.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Yuri Levinsky
Markus,
It's no relation between partitions and raids despite they both
distribute data somehow. By the end of the day when you use the raid you
have one single device with some performance limitations. When you want
to improve your data access after that and not to work with huge indexes
that you unable to maintain or you don't want to use index like in case
of range partition by time or hash partition: you welcome to use
partitions. You typically don't want to use b-tree index when yo select
more when ~1-2% of your data. 

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-Original Message-
From: k...@rice.edu [mailto:k...@rice.edu] 
Sent: Wednesday, June 26, 2013 5:01 PM
To: Markus Wanner
Cc: Kevin Grittner; Claudio Freire; Robert Haas; Bruce Momjian; Yuri
Levinsky; PostgreSQL-Dev
Subject: Re: [HACKERS] Hash partitioning.

On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote:
 On 06/25/2013 11:52 PM, Kevin Grittner wrote:
  At least until we have parallel
  query execution.  At *that* point this all changes.
 
 Can you elaborate on that, please? I currently have a hard time 
 imagining how partitions can help performance in that case, either. At

 least compared to modern RAID and read-ahead capabilities.
 
 After all, RAID can be thought of as hash partitioning with a very 
 weird hash function. Or maybe rather range partitioning on an internal
key.
 
 Put another way: ideally, the system should take care of optimally 
 distributing data across its physical storage itself. If you need to 
 do partitioning manually for performance reasons, that's actually a 
 deficiency of it, not a feature.
 
 I certainly agree that manageability may be a perfectly valid reason 
 to partition your data. Maybe there even exist other good reasons. I 
 don't think performance optimization is one. (It's more like giving 
 the system a hint. And we all dislike hints, don't we? *ducks*)
 
 Regards
 
 Markus Wanner
 

Hi Markus,

I think he is referring to the fact that with parallel query execution,
multiple partitions can be processed simultaneously instead of serially
as they are now with the resulting speed increase.

Regards,
Ken

This mail was received via Mail-SeCure System.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Bruce Momjian
On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote:
 In practice, there might be a lot of quirks and inefficiencies and
 locking contention etc. involved in various DBMS's, that you might
 be able to work around with hash partitioning. But from a
 theoretical point of view, there is no reason to expect just
 partitioning a table on a hash to make key-value lookups any faster.

Good analysis.  Has anyone benchmarked this to know our btree is
efficient in this area?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Yuri Levinsky
 Heiki,
This is most professional explanation that I ever seen. Let me please
disagree with a bottom line. It's heavily depends on amount of memory
and actual index sizes. I did a benchmark ~6 years ago and I won a glass
of beer.  Anyway I am talking about hash partitioning as a feature and
my example about compare with unique b-tree index scan is little bit
extreme. In case you have 2,4,8..1024 different values (not known in
advance) the index might be eliminated. That's whole the feature: no
competition for hash function.   

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-Original Message-
From: Heikki Linnakangas [mailto:hlinnakan...@vmware.com] 
Sent: Wednesday, June 26, 2013 5:10 PM
To: Yuri Levinsky
Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL
Mailing Lists
Subject: Re: [HACKERS] Hash partitioning.

On 26.06.2013 16:41, Yuri Levinsky wrote:
 Heikki,
 As far as I understand the height of the btree will affect the number 
 of I/Os necessary. The height of the tree does not increase linearly 
 with the number of records.

The height of a b-tree is O(log n), where n is the number of records. 
Informally, if we assume that you have on average, say, 1000 keys on one
b-tree page, a two-level b-tree can hold one million items, and a three
level one billion items, and so on. The height of the tree affects the
number of I/Os needed for searches, but keep in mind that the top levels
of the tree are going to be very frequently accessed and in practice
will stay permanently cached. You will only perform actual I/O on the
1-2 bottom levels of the tree (or 0 if it all fits in cache)

Now let's compare that with a hash partitioned table, with 1000
partitions and a b-tree index on every partition. When doing a search,
you first hash the key to look up the right partition, then you search
the index of that partition. This is almost equivalent to just having a
b-tree that's one level taller - instead of looking up the right
partition in the hash table, you look up the right child page at the
root of the b-tree. From a very coarse theoretical point of view, the
only difference is that you replaced the binary search on the b-tree
root page with an equivalent hash lookup. A hash lookup can be somewhat
cheaper than binary search, but in practice there is very little
difference. There certainly isn't any difference in the number of actual
I/O performed.

In practice, there might be a lot of quirks and inefficiencies and
locking contention etc. involved in various DBMS's, that you might be
able to work around with hash partitioning. But from a theoretical point
of view, there is no reason to expect just partitioning a table on a
hash to make key-value lookups any faster.

- Heikki

This mail was received via Mail-SeCure System.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Developer meeting photos

2013-06-26 Thread Alexander Korotkov
Hackers,

at last developer meeting we missed Oleg Bartunov. So, it's not surprising
that photos is also missed.
I remember that somebody took photos, but unfortunately it appears that I
don't remember who.
My employer who sponsored my attendance in PGCon want to publish post about
it on the website. So, it would be very nice to have a photo of developer
meeting.
Does anybody have it?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 26.06.2013 11:17, Yuri Levinsky wrote:
 When you dealing with company, which has
 ~350.000.000 users, and you don't want to use key/value data stores: you
 need hash partitioned tables and hash partitioned table clusters to
 perform fast search and 4-6 tables join based on user phone number for
 example.

 B-trees are surprisingly fast for key-value lookups. There is no reason 
 to believe that a hash partitioned table would be faster for that than a 
 plain table.

Or in short: the quoted advice may very well be true for Oracle, but
applying it blindly to Postgres is not a good idea.  PG's performance
characteristics are a lot different, especially in the area of
partitioned tables.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Tom Lane
Ants Aasma a...@cybertec.at writes:
 On Wed, Jun 26, 2013 at 9:46 AM, Atri Sharma atri.j...@gmail.com wrote:
 I have been reading the current implementation of hash joins, and in
 ExecScanHashBucket, which I understand is the actual lookup function,
 we could potentially look at a bloom filter per bucket. Instead of
 actually looking up each hash value for the outer relation, we could
 just check the corresponding bloom filter for that bucket, and if we
 get a positive, then lookup the actual values i.e. continue with our
 current behaviour (since we could be looking at a false positive).

 The problem here is that if the hash table is in memory, doing a hash
 table lookup directly is likely to be cheaper than a bloom filter
 lookup,

Yeah.  Given the plan to reduce NTUP_PER_BUCKET to 1, it's hard to see
how adding a Bloom filter phase could be anything except overhead.  Even
with the current average bucket length, it doesn't sound very promising.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-26 Thread Atri Sharma
On Wed, Jun 26, 2013 at 7:12 PM, Stephen Frost sfr...@snowman.net wrote:
 Atri,

 * Atri Sharma (atri.j...@gmail.com) wrote:
 I just popped in here on Simon's advice to put an idea I had about
 optimizing hash joins on this thread.

 I'd encourage reading the thread a bit first, in the future.. :)

Yeah, I actually read a bit(admitted, not much) of the above thread. I
was following it a bit as well.

 I suggested this up-thread already, but it's not really a bloom filter
 as there's only one hash function available- I can't see us requiring
 every data type to provide multiple hash functions.  Still, I do think
 breaking the single 32-bit hash key space up into fixed-sized chunks and
 then having a bitfield array which we test against (very similar to how
 the visibility map works) to see if there's any chance that a given hash
 key exists might be valuable.  The problem is that, because we don't
 have multiple hash functions, it's not clear how much empty space we'd
 actually end up with.

Agreed.

 We could have a bitfield filter (as I described above) created for each
 bucket and then test against that before considering if we actually have
 to go look in that bucket, yes.  I'm not sure if that's quite what you
 were thinking, but I can see how a bitfield per bucket might work.  If
 you were suggesting something else, please clarify.

Yeah, this is what I wanted.

My point is that I would like to help in the implementation, if possible. :)

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LATERAL quals revisited

2013-06-26 Thread Antonin Houska

On 06/26/2013 12:52 AM, Tom Lane wrote:

Instead of setting it aside, can we (mis)use placeholder var (PHV), to
ensure that the WHERE clause is evaluated below the OJ; instead of
combining it with the ON clause?

No, that doesn't help; it has to be part of the joinquals at the join
node, or you don't get the right execution semantics.
When I wrote 'below the OJ' I meant to retain something of the original 
semantics (just like the subquery applies the WHERE clause below the OJ).

However that's probably too restrictive and your next arguments

Plus you could
lose some optimization opportunities, for example if we fail to see
that there's a strict join clause associated with the outer join
(cf lhs_strict).  Worse, I think wrapping a PHV around an otherwise
indexable clause would prevent using it for an indexscan.


also confirm the restrictiveness. So I can forget.

One more concern anyway: doesn't your proposal make subquery pull-up a 
little bit risky in terms of cost of the resulting plan?


IMO the subquery in the original query may filter out many rows and thus 
decrease the number of pairs to be evaluated by the join the ON clause 
belongs to.
If the WHERE clause moves up, then the resulting plan might be less 
efficient than the one we'd get if the subquery hadn't been pulled-up at 
all.


However at the time of cost evaluation there's no way to get back (not 
even to notice the higher cost) because the original subquery has gone 
at earlier stage of the planning.


Regards,
Antonin Houska (Tony)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner
On 06/26/2013 04:10 PM, Yuri Levinsky wrote:
 You typically don't want to use b-tree index when yo select
 more when ~1-2% of your data. 

Agreed. Indices on columns with very low selectivity don't perform well.
(Postgres knows that and uses a sequential scan based on selectivity
estimates. Being able to eliminate entire partitions from such a seq
scan would certainly be beneficial, yes.)

In the Postgres world, though, I think CLUSTERing might be the better
approach to solve that problem. (Note: this has nothing to do with
distributed systems in this case.) I'm not sure what the current status
of auto clustering or optimized scans on such a permanently clustered
table is, though.

The minmax indices proposed for 9.4 might be another feature worth
looking at.

Both of these approaches may eventually provide a more general and more
automatic way to speed up scans on large portions of a relation, IMO.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LATERAL quals revisited

2013-06-26 Thread Tom Lane
Antonin Houska antonin.hou...@gmail.com writes:
 If the WHERE clause moves up, then the resulting plan might be less 
 efficient than the one we'd get if the subquery hadn't been pulled-up at 
 all.

No, because we can push the qual back down again (using a parameterized
path) if that's appropriate.  The problem at this stage is to understand
the semantics of the outer join correctly, not to make a choice of what
the plan will be.

In fact, the reason we'd not noticed this bug before is exactly that
all the test cases in the regression tests do end up pushing the qual
back down.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugfix and new feature for PGXS

2013-06-26 Thread Andrew Dunstan


On 06/25/2013 11:29 AM, Cédric Villemain wrote:

Le mardi 25 juin 2013 17:18:51, Andrew Dunstan a écrit :

On 06/24/2013 07:24 PM, Cédric Villemain wrote:

Le mardi 25 juin 2013 00:18:26, Andrew Dunstan a écrit :

On 06/24/2013 04:02 PM, Cédric Villemain wrote:

WIth extension, we do have to set VPATH explicitely if we want to use
VPATH (note that contribs/extensions must not care that postgresql has
been built with or without VPATH set). My patches try to fix that.

No, this is exactly what I'm objecting to. I want to be able to do:
  invoke_vpath_magic
  standard_make_commands_as_for_non_vpath

Your patches have been designed to overcome your particular issues, but
they don't meet the general case, IMNSHO. This is why I want to have
more discussion about how vpath builds could work for PGXS modules.

The patch does not restrict anything, it is not supposed to lead to
regression.
The assignment of VPATH and srcdir are wrong in the PGXS case, the patch
correct them. You're still free to do make VPATH=/mypath ... the VPATH
provided won't be erased by the pgxs.mk logic.

I still think this is premature.  I have spent some more time trying to
make it work the way I think it should, so far without success. I think
we need more discussion about how we'd like VPATH to work for PGXS
would. There's no urgency about this - we've lived with the current
situation for quite a while.

Sure...
I did a quick and dirty patch (I just validate without lot of testing),
attached to this email to fix json_build (at least for make, make install)
As you're constructing targets based on commands to execute in the srcdir
directory, and because srcdir is only set in pgxs.mk, it is possible to define
srcdir early in the json_build/Makefile and use it.



This still doesn't do what I really want, which is to be able to invoke 
make without anything special in the invocation that triggers VPATH 
processing.


Here's what I did that works like I think it should.

First the attached patch on top of yours.

Second, I did:

mkdir vpath.json_build
cd vpath.json_build
sh/path/to/pgsource/config/prep_buildtree ../json_build .
ln -s ../json_build/json_build.control .

Then I created vpath.mk with these contents:

ext_srcdir = ../json_build
USE_VPATH = $(ext_srcdir)

Finally I vpath-ized the Makefile (see attached).

Given all of that, I was able to do, in the vpath directory:

make
make install
make installcheck
make clean

and it all just worked, with exactly the same make invocations as work 
in an in-tree build.


So what's lacking to make this nice is a tool to automate the second and 
third steps above.


Maybe there are other ways of doing this, but this does what I'd like to 
see.


cheers

andrew

diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk
index f70d5f7..42e3654 100644
--- a/src/makefiles/pgxs.mk
+++ b/src/makefiles/pgxs.mk
@@ -61,18 +61,21 @@ ifdef PGXS
 top_builddir := $(dir $(PGXS))../..
 include $(top_builddir)/src/Makefile.global
 
-# If Makefile is not in current directory we are building the extension with
-# VPATH so we set the variable here
-# XXX what about top_srcdir ?
-ifeq ($(CURDIR),$(dir $(firstword $(MAKEFILE_LIST
 top_srcdir = $(top_builddir)
+# If USE_VPATH is set or Makefile is not in current directory we are building 
+# the extension with VPATH so we set the variable here
+ifdef USE_VPATH
+srcdir = $(USE_VPATH)
+VPATH = $(USE_VPATH)
+else
+ifeq ($(CURDIR),$(dir $(firstword $(MAKEFILE_LIST
 srcdir = .
 VPATH =
 else
-top_srcdir = $(top_builddir)
 srcdir = $(dir $(firstword $(MAKEFILE_LIST)))
 VPATH = $(srcdir)
 endif
+endif
 
 # These might be set in Makefile.global, but if they were not found
 # during the build of PostgreSQL, supply default values so that users
EXTENSION= json_build

ifeq ($(wildcard vpath.mk),vpath.mk)
include vpath.mk
else
ext_srcdir = .
endif

EXTVERSION   = $(shell grep default_version $(EXTENSION).control | sed -e 
s/default_version[[:space:]]*=[[:space:]]*'\([^']*\)'/\1/)
DATA = $(filter-out $(wildcard sql/*--*.sql),$(wildcard sql/*.sql))
DOCS = $(wildcard $(ext_srcdir)/doc/*.md)
USE_MODULE_DB = 1
TESTS= $(wildcard $(ext_srcdir)/test/sql/*.sql)
REGRESS_OPTS = --inputdir=$(ext_srcdir)/test --outputdir=test \
--load-extension=$(EXTENSION)
REGRESS  = $(patsubst $(ext_srcdir)/test/sql/%.sql,%,$(TESTS))
MODULE_big  = $(EXTENSION)
OBJS = $(patsubst $(ext_srcdir)/src/%.c,src/%.o,$(wildcard 
$(ext_srcdir)/src/*.c))
PG_CONFIG= pg_config

all: sql/$(EXTENSION)--$(EXTVERSION).sql

sql/$(EXTENSION)--$(EXTVERSION).sql: $(ext_srcdir)/sql/$(EXTENSION).sql
cp $ $@

DATA_built = sql/$(EXTENSION)--$(EXTVERSION).sql
DATA = $(filter-out $(ext_srcdir)/sql/$(EXTENSION)--$(EXTVERSION).sql, 
$(wildcard $(ext_srcdir)/sql/*--*.sql))
EXTRA_CLEAN = sql/$(EXTENSION)--$(EXTVERSION).sql

PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

# we put all the tests in a test subdir, but 

Re: [HACKERS] Computer VARSIZE_ANY(PTR) during debugging

2013-06-26 Thread Alvaro Herrera
Amit Langote escribió:

 The segfault in question happens at line 1141:
 
 off = att_align_pointer(off, thisatt-attalign, -1, tp + off);
 
 char   *tp; /* ptr to tuple data */
 longoff;/* offset in tuple data */
 
 Disassembling seems to suggest (tp + off) is the faulting address.
 Apparently, the segfault happens when 5th text column is being
 extracted from a tuple (char(n), char(n), int4, char(n), text, ...).
 Since, tp is fixed for the whole duration of loop and only off is
 subject to change over iterations, it may have happened due to wrong
 offset in this iteration.
 
 Has anything of this kind been encountered/reported before?

Yes, I vaguely recall I have seen this in cases where tuples contain
corrupt data.  I think you just need the length word of the fourth datum
to be wrong.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner
On 06/26/2013 04:01 PM, k...@rice.edu wrote:
 I think he is referring to the fact that with parallel query execution,
 multiple partitions can be processed simultaneously instead of serially
 as they are now with the resulting speed increase.

Processing simultaneously is the purpose of parallel query execution,
yes. But I see no reason for that not to work equally well for
unpartitioned tables.

Disk I/O is already pretty well optimized and parallelized, I think.
Trying to parallelize a seq scan on the Postgres side is likely to yield
far inferior performance.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Bruce Momjian
On Wed, Jun 26, 2013 at 05:04:11PM +0200, Markus Wanner wrote:
 On 06/26/2013 04:01 PM, k...@rice.edu wrote:
  I think he is referring to the fact that with parallel query execution,
  multiple partitions can be processed simultaneously instead of serially
  as they are now with the resulting speed increase.
 
 Processing simultaneously is the purpose of parallel query execution,
 yes. But I see no reason for that not to work equally well for
 unpartitioned tables.

Well, I think by definition you are going to be able to spread lookups
for a particular range across more partitions with 'hash' than with
range or list partitioning.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Markus Wanner
On 06/25/2013 08:26 PM, Andres Freund wrote:
 It's not about the reviewers being less. It's a comparison of
 effort. The effort for a casual review simply isn't comparable with the
 effort spent on developing a nontrivial patch.

Remember: Debugging is twice as hard as writing the code in the first
place. ... (Brian Kernighan)

IMO, the kind of reviews we need are of almost debug level difficulty.
(To the point where the reviewer becomes a co-author or even takes over
and submits a completely revamped patch instead.)

I agree that the casual review is several levels below that, so your
point holds. I doubt we need more reviews of that kind, though.

Thus, I'm in the AAB camp. The remaining question being: What's the
criterion for becoming a co-author (and thus getting mentioned in the
release notes)?

If at all, we should honor quality work with a prize. Maybe a price
for the best reviewer per CF? Maybe even based on votes from the general
public on who's been the best, so reviews gain attention that way...
Click here to vote for my review. ... Maybe a crazy idea.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-06-26 Thread Noah Misch
On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote:
 Thanks for your work, your patch is definitely better. I agree that this  
 approach much more generic.

Committed.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Kevin Grittner
Markus Wanner mar...@bluegap.ch wrote:
 On 06/25/2013 11:52 PM, Kevin Grittner wrote:
 At least until we have parallel
 query execution.  At *that* point this all changes.

 Can you elaborate on that, please? I currently have a hard time
 imagining how partitions can help performance in that case,
 either.

Well, partitioning will *still* be a net loss for overall
throughput on a machine with enough active connections to keep all
the resources busy.  Where it will help is when you have a machine
with a lot of cores and a few big reporting style queries.  Since
we currently can only use one core for a single query, we leave a
lot of CPU time (often the bottleneck for such queries) unused.  If
we allow a large query to search multiple partitions in parallel, a
big query can complete sooner.  It will consume more resources
overall in doing so, but if those resources would otherwise sit
idle, it could be a big win for some use cases.

 Put another way: ideally, the system should take care of
 optimally distributing data across its physical storage itself.

Agreed.  That's not where I see the win.  Most cases where I've
seen people attempt to micro-manage object placement have performed
worse than somply giving the OS a big RAID (we had 40 spindles in
some of ours at Wis. Courts) and letting the filesystem figure it
out.  There are exceptions for special cases like WAL.  I found out
by accident how much that can matter.

 If you need to do partitioning manually for performance reasons,
 that's actually a deficiency of it, not a feature.

+1

 I certainly agree that manageability may be a perfectly valid
 reason to partition your data.

It can definitely help there.

 Maybe there even exist other good reasons.

I'm arguing that better concurrency can be one in the future.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Heikki Linnakangas

On 26.06.2013 18:34, Kevin Grittner wrote:

Markus Wannermar...@bluegap.ch  wrote:

On 06/25/2013 11:52 PM, Kevin Grittner wrote:

At least until we have parallel
query execution.  At *that* point this all changes.


Can you elaborate on that, please? I currently have a hard time
imagining how partitions can help performance in that case,
either.


Well, partitioning will *still* be a net loss for overall
throughput on a machine with enough active connections to keep all
the resources busy.  Where it will help is when you have a machine
with a lot of cores and a few big reporting style queries.  Since
we currently can only use one core for a single query, we leave a
lot of CPU time (often the bottleneck for such queries) unused.  If
we allow a large query to search multiple partitions in parallel, a
big query can complete sooner.


We could also allow a large query to search a single table in parallel. 
A seqscan would be easy to divide into N equally-sized parts that can be 
scanned in parallel. It's more difficult for index scans, but even then 
it might be possible at least in some limited cases.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-26 Thread Stephen Frost
* Atri Sharma (atri.j...@gmail.com) wrote:
 My point is that I would like to help in the implementation, if possible. :)

Feel free to go ahead and implement it..  I'm not sure when I'll have a
chance to (probably not in the next week or two anyway).  Unfortunately,
the bigger issue here is really about testing the results and
determining if it's actually faster/better with various data sets
(including ones which have duplicates).  I've got one test data set
which has some interesting characteristics (for one thing, hashing the
large side and then seq-scanning the small side is actually faster
than going the other way, which is quite 'odd' imv for a hashing
system): http://snowman.net/~sfrost/test_case2.sql

You might also look at the other emails that I sent regarding this
subject and NTUP_PER_BUCKET.  Having someone confirm what I saw wrt
changing that parameter would be nice and it would be a good comparison
point against any kind of pre-filtering that we're doing.

One thing that re-reading the bloom filter description reminded me of is
that it's at least conceivable that we could take the existing hash
functions for each data type and do double-hashing or perhaps seed the
value to be hashed with additional data to produce an independent hash
result to use.  Again, a lot of things that need to be tested and
measured to see if they improve overall performance.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-26 Thread Heikki Linnakangas

On 24.06.2013 21:01, Andres Freund wrote:

Ok, I started to look at this:


Thanks!


* Could you document the way slots prevent checkpoints from occurring
   when XLogInsert rechecks for full page writes? I think it's correct -
   but not very obvious on a glance.


There's this in the comment near the top of the file:

 * To update RedoRecPtr or fullPageWrites, one has to make sure that all
 * subsequent inserters see the new value. This is done by reserving 
all the
 * insertion slots before changing the value. XLogInsert reads 
RedoRecPtr and

 * fullPageWrites after grabbing a slot, so by holding all the slots
 * simultaneously, you can ensure that all subsequent inserts see the new
 * value.  Those fields change very seldom, so we prefer to be fast and
 * non-contended when they need to be read, and slow when they're changed.

Does that explain it well enough? XLogInsert holds onto a slot while it 
rechecks for full page writes.



* The read of Insert-RedoRecPtr while rechecking whether it's out of
   date now is unlocked, is that correct? And why?


Same here, XLogInsert holds the slot while rechecking Insert-RedoRecptr.


* Have you measured whether it works to just keep as many slots as
   max_backends requires around and not bothering with dynamically
   allocating them to inserters?
   That seems to require to keep actually waiting slots in a separate
   list which very well might make that too expensive.

   Correctness wise the biggest problem for that probably is exlusive
   acquiration of all slots CreateCheckpoint() does...


Hmm. It wouldn't be much different, each backend would still need to 
reserve its own dedicated slot, because it might be held by the a 
backend that grabbed all the slots. Also, bgwriter and checkpointer need 
to flush the WAL, so they'd need slots too.


More slots make WaitXLogInsertionsToFinish() more expensive, as it has 
to loop through all of them. IIRC some earlier pgbench tests I ran 
didn't show much difference in performance, whether there were 40 slots 
or 100, as long as there was enough of them. I can run some more tests 
with even more slots to see how it behaves.



* What about using some sort of linked list of unused slots for
   WALInsertSlotAcquireOne? Everytime we're done we put it to the *end*
   of the list so it's less likely to have been grabbed by somebody else
   so we can reuse it.
   a) To grab a new one go to the head of the linked list spinlock it and
   recheck whether it's still free. If not, restart. Otherwise, remove
   from list.
   b) To reuse a previously used slot

   That way we only have to do the PGSemaphoreLock() dance if there
   really aren't any free slots.


That adds a spinlock acquisition / release into the critical path, to 
protect the linked list. I'm trying very hard to avoid serialization 
points like that.


While profiling the tests I've done, finding a free slot hasn't shown up 
much. So I don't think it's a problem performance-wise as it is, and I 
don't think it would make the code simpler.



* The queuing logic around slots seems to lack documentation. It's
   complex enough to warrant that imo.


Yeah, it's complex. I expanded the comments above XLogInsertSlot, to 
explain how xlogInsertingAt works. Did that help, or was it some other 
part of that that needs more docs?



* Not a big fan of the holdingAll variable name, for a file-global one
   that seems a bit too generic.


Renamed to holdingAllSlots.


* Could you add a #define or comment for the 64 used in
   XLogInsertSlotPadded? Not everyone might recognize that immediately as
   the most common cacheline size.
   Commenting on the reason we pad in general would be a good idea as
   well.


Copy-pasted and edited the explanation from LWLockPadded for that. I 
also changed the way it's allocated so that it's aligned at 64-byte 
boundary, like we do for lwlocks.



* Is it correct that WALInsertSlotAcquireOne() resets xlogInsertingAt of
   all slots *before* it has acquired locks in all of them? If yes, why
   haven't we already reset it to something invalid?


I didn't understand this part. Can you elaborate?


* Is GetXLogBuffer()'s unlocked read correct without a read barrier?


Hmm. I thought that it was safe because GetXLogBuffer() handles the case 
that you get a torn read of the 64-bit XLogRecPtr value. But now that 
I think of it, I wonder if it's possible for reads/writes to be 
reordered so that AdvanceXLInsertBuffer() overwrites WAL data that 
another backend has copied onto a page. Something like this:


1. Backend B zeroes a new WAL page, and sets its address in xlblocks
2. Backend A calls GetXLogBuffer(), and gets a pointer to that page
3. Backend A copies the WAL data to the page.

There is no lock acquisition in backend A during those steps, so I think 
in theory the writes from step 3 might be reordered to happen before 
step 1, so that that step 1 overwrites the WAL data written in step 3. 
It sounds crazy, but after reading 

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner
On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:
 We could also allow a large query to search a single table in parallel.
 A seqscan would be easy to divide into N equally-sized parts that can be
 scanned in parallel. It's more difficult for index scans, but even then
 it might be possible at least in some limited cases.

So far reading sequentially is still faster than hopping between
different locations. Purely from the I/O perspective, that is.

For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-26 Thread Atri Sharma
On Wed, Jun 26, 2013 at 9:20 PM, Stephen Frost sfr...@snowman.net wrote:
 * Atri Sharma (atri.j...@gmail.com) wrote:
 My point is that I would like to help in the implementation, if possible. :)

 Feel free to go ahead and implement it..  I'm not sure when I'll have a
 chance to (probably not in the next week or two anyway).  Unfortunately,
 the bigger issue here is really about testing the results and
 determining if it's actually faster/better with various data sets
 (including ones which have duplicates).  I've got one test data set
 which has some interesting characteristics (for one thing, hashing the
 large side and then seq-scanning the small side is actually faster
 than going the other way, which is quite 'odd' imv for a hashing
 system): http://snowman.net/~sfrost/test_case2.sql

 You might also look at the other emails that I sent regarding this
 subject and NTUP_PER_BUCKET.  Having someone confirm what I saw wrt
 changing that parameter would be nice and it would be a good comparison
 point against any kind of pre-filtering that we're doing.

 One thing that re-reading the bloom filter description reminded me of is
 that it's at least conceivable that we could take the existing hash
 functions for each data type and do double-hashing or perhaps seed the
 value to be hashed with additional data to produce an independent hash
 result to use.  Again, a lot of things that need to be tested and
 measured to see if they improve overall performance.

Right, let me look.Although, I am pretty busy atm with ordered set
functions, so will get it done maybe last week of this month.

Another thing I believe in is that we should have multiple hashing
functions for bloom filters, which generate different probability
values so that the coverage is good.

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn

2013-06-26 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-06-17 16:16:22 +0200, Andres Freund wrote:
 Not accepting EWOULDBLOCK in the above if() results in:
 could not connect to server: Resource temporarily unavailable
 Is the server running locally and accepting
 connections on Unix domain socket /tmp/.s.PGSQL.5440?
 which makes more sense.

 Could I convince a committer to NACK or commit  backpatch that patch?

Some trawling in the commit history shows that the current logic dates
from my commit 6d0d838cebdf2bcd5c03f5449a1888f1e120496f, which unified
Windows and non-Windows code paths; the check for EWOULDBLOCK was added
in the earlier commit ca5a51627919c6fb6ab5e23739615a674caa4037 which
(claimed to) add support for non-blocking connect on Windows.  So I'm
concerned that your patch could break that platform.  A possible
compromise is

{
if (SOCK_ERRNO == EINPROGRESS ||
+#ifndef WIN32
SOCK_ERRNO == 
EWOULDBLOCK ||
+#endif
SOCK_ERRNO == EINTR ||
SOCK_ERRNO == 0)

but I wonder whether it's safe to remove the case altogether.  Could
anyone research the situation for non-blocking connect() on Windows?
Perhaps on Windows we shouldn't test for EINPROGRESS at all?

I'm also pretty suspicious of the SOCK_ERRNO == 0 case, now that I look
at it, especially in view of the lack of any attempt to set errno to 0
before the call.  The Single Unix Spec is pretty clear that EINTR and
EINPROGRESS are the only legit cases, so unless Windows is doing its own
thing, we could probably get rid of that too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Computer VARSIZE_ANY(PTR) during debugging

2013-06-26 Thread Amit Langote
On Thu, Jun 27, 2013 at 12:02 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Amit Langote escribió:

 The segfault in question happens at line 1141:

 off = att_align_pointer(off, thisatt-attalign, -1, tp + off);

 char   *tp; /* ptr to tuple data */
 longoff;/* offset in tuple data */

 Disassembling seems to suggest (tp + off) is the faulting address.
 Apparently, the segfault happens when 5th text column is being
 extracted from a tuple (char(n), char(n), int4, char(n), text, ...).
 Since, tp is fixed for the whole duration of loop and only off is
 subject to change over iterations, it may have happened due to wrong
 offset in this iteration.

 Has anything of this kind been encountered/reported before?

 Yes, I vaguely recall I have seen this in cases where tuples contain
 corrupt data.  I think you just need the length word of the fourth datum
 to be wrong.


The query in question is:

select col1, col2, col4, octet_length(col5) from table where
octet_length(col5)  800;

In case of corrupt data, even select * from table should give
segfault, shouldn't it?

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Dimitri Fontaine
Josh Berkus j...@agliodbs.com writes:
 Well, one of the other prizes which occurred to me today would be a
 pgCon lottery.  That is, each review posted by a non-committer would go
 in a hat, and in February we would draw one who would get a free
 registration and airfare to pgCon.

+1, I like that idea!

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for dbcommands

2013-06-26 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 I am generally a bit unsure whether the regression tests you propose
 aren't a bit too verbose. Does any of the committers have an opinion
 about this?

 My feeling is that they are ok if they aren't slowing down things much.

Yeah, I'm concerned about speed too.  If the regression tests take a
long time it makes it harder for developers to run them.  (I like to
point at mysql's regression tests, which take well over an hour even on
fast machines.  If lots of tests are so helpful, why is their bug rate
no better than ours?)

I was intending to suggest that much of what Robins has submitted
doesn't belong in the core regression tests, but could usefully be put
into an optional set of big regression tests.  We already have a
numeric_big test in that spirit.  What seems to be lacking is an
organizational principle for this (should the big tests live with the
regular ones, or in a separate directory?) and a convenient make
target for running the big ones.  Once we had a setup like that, we
could get some or all of the buildfarm machines to run the big tests,
but we wouldn't be penalizing development work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing pg_ctl with relative paths

2013-06-26 Thread Fujii Masao
On Wed, Jun 26, 2013 at 2:36 PM, Hari Babu haribabu.ko...@huawei.com wrote:
 On June 26, 2013 5:02 AM Josh Kupershmidt wrote:
Thanks for the feedback. Attached is a rebased version of the patch with
 the two small issues noted fixed.

The following description in the document of pg_ctl needs to be modified?

restart might fail if relative paths specified were specified on
the command-line during server start.

+#define DATADIR_SPEC   \-D\ \
+
+   datadir = strstr(post_opts, DATADIR_SPEC);

Though this is a corner case, the patch doesn't seem to handle properly the case
where -D appears as other option value, e.g., -k option value, in
postmaster.opts
file.

Just idea to work around that problem is to just append the specified -D option
and value to post_opts. IOW, -D option and value appear twice in post_opts.
In this case, posteriorly-located ones are used in the end. Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages

2013-06-26 Thread Fujii Masao
On Wed, Jun 26, 2013 at 8:57 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Wednesday, June 26, 2013 4:40 PM Andres Freund wrote:
 Hi Amit,

 On 2013-06-26 16:22:28 +0530, Amit Kapila wrote:
  On Wednesday, June 26, 2013 1:20 PM Andres Freund wrote:
   On 2013-06-26 08:50:27 +0530, Amit Kapila wrote:
On Tuesday, June 25, 2013 11:12 PM Andres Freund wrote:
 On 2013-06-16 17:19:49 -0700, Josh Berkus wrote:
  Amit posted a new version of this patch on January 23rd.  But
   last
  comment on it by Tom is not sure everyone wants this.
 
  https://commitfest.postgresql.org/action/patch_view?id=905

  ... so, what's the status of this patch?

 That comment was referencing a mail of mine - so perhaps I
 better
 explain:

 I think the usecase for this utility isn't big enough to be
   included in
 postgres since it really can only help in a very limited
 circumstances. And I think it's too likely to be misused for
 stuff
   it's
 not useable for (e.g. remastering).

 The only scenario I see is that somebody deleted/corrupted
 pg_controldata. In that scenario the tool is supposed to be
 used to
 find
 the biggest lsn used so far so the user then can use
 pg_resetxlog
   to
 set
 that as the wal starting point.
 But that can be way much easier solved by just setting the LSN
 to
 something very, very high. The database cannot be used for
 anything
 reliable afterwards anyway.
   
One of the main reason this was written was to make server up in
 case
   of
corruption and
user can take dump of some useful information if any.
   
By setting LSN very, very high user might loose the information
 which
   he
wants to take dump.
  
   Which information would that loose?
Information from WAL replay which can be more appropriate by
 selecting
  LSN.

 Sorry, I can't follow. If wal replay still is an option you can just
 look at the WAL and get a sensible value way easier.

 Originally 2 parts were proposed, one was to get LSN from data pages and
 other from data pages.
 Original proposal is:
 http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382851FFA
 1@szxeml509-mbs

 The second part for looking into WAL was written but due to xlogreader
 patch, it was postponed and I didn't get time after that
 to pursue it.


The whole tool
 seems to only make sense if you've lost pg_xlog.

 The tool's initial intent was if pg_controldata is lost and this idea is
 originated in below mail chain:
 http://www.postgresql.org/message-id/4274.1340084...@sss.pgh.pa.us


Also for a developer, guessing very high LSN might be easy, but for
 users
it might not be equally easy, and getting such value by utility
 would be
  comfortable.

 Well, then we can just document some very high lsn and be done with
 it. Like CF00/.
 That would leave enough space for eventual writes caused while dumping
 the database (say hint bit writes in a checksummed database) and cannot
 yet be realistically be reached during normal operation.

 Can we be ultra sure, that this LSN is not reached. I think it will take
 vary long to reach such LSN, but still theoretically it can be possible.
 I don't have any evidence.

One more use case for which this utility was done is as below:
It will be used to decide that on new-standby (old-master) whether
 a full
  backup is needed from
New-master(old-standby).
The backup is required when the data page in old-master precedes
the last applied LSN in old-standby (i.e., new-master) at the
 moment
of the failover.

 That's exactly what I was afraid of. Unless I miss something the tool
 is
 *NOT* sufficient to do this.

 You mean to say if user knows the max LSN of data pages in old-master and
 last applied LSN in new master, he cannot decide whether
 Full backup is needed? It should be straightforward decision that skip a
 backup if that old-master LSN is less than the new-master (i.e., last
 applied LSN, IOW, timeline switch LSN).
 It was proposed as a usecase in this below mail:
 http://www.postgresql.org/message-id/CAHGQGwHyd1fY0hF0qKh0-uKDh-gcbYxMOFBYVk
 kh4jzji-f...@mail.gmail.com

I guess he meant the commit hint bit problem.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-26 Thread Stephen Frost
* Atri Sharma (atri.j...@gmail.com) wrote:
 Right, let me look.Although, I am pretty busy atm with ordered set
 functions, so will get it done maybe last week of this month.

Isn't it currently the last week of this month? :)  I'm guessing you
mean July.

 Another thing I believe in is that we should have multiple hashing
 functions for bloom filters, which generate different probability
 values so that the coverage is good.

I really don't see that happening, to be honest..  I think it would be
interesting to try some of the surrogate-additional-hashing that I
described.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-26 Thread Atri Sharma

 Isn't it currently the last week of this month? :)  I'm guessing you
 mean July.
 Heh,no.I lose track of time these days. Alright, second week of July then.


 I really don't see that happening, to be honest..  I think it would be
 interesting to try some of the surrogate-additional-hashing that I
 described.

Yes, I agree to that.

Let me think more and get back on this.

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement

2013-06-26 Thread Pavel Stehule
Hello

updated patch with some basic doc

Regards

Pavel


2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com:



 On Wed, Jun 26, 2013 at 5:11 PM, Pavel Stehule pavel.steh...@gmail.com
 wrote:

 2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com:
 
 
 
  On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule
  pavel.steh...@gmail.com
  wrote:
 
  2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com:
  
  
  
   On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule
   pavel.steh...@gmail.com
   wrote:
  
   Hello
  
   This is fragment ofmy old code from orafce package - it is
   functional,
   but it is written little bit more generic due impossible access to
   static variables (in elog.c)
  
   static char*
   dbms_utility_format_call_stack(char mode)
   {
  MemoryContext oldcontext = CurrentMemoryContext;
  ErrorData *edata;
  ErrorContextCallback *econtext;
  StringInfo sinfo;
  
  #if PG_VERSION_NUM = 80400
 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO,
   TEXTDOMAIN);
  #else
 errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO);
   #endif
  
  MemoryContextSwitchTo(oldcontext);
  
  for (econtext = error_context_stack;
econtext != NULL;
econtext = econtext-previous)
   (*econtext-callback) (econtext-arg);
  
  edata = CopyErrorData();
  FlushErrorState();
  
   https://github.com/orafce/orafce/blob/master/utility.c
  
   2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com:
Hi,
   
Use of this feature is to get  call stack for debugging without
raising
exception. This definitely seems very useful.
   
Here are my comments about the submitted patch:
   
Patch gets applied cleanly (there are couple of white space
warning
but
that's
into test case file). make and make install too went smooth. make
check
was smooth too. Did some manual testing about feature and its went
smooth.
   
However would like to share couple of points:
   
  
   My main motive is concentration to stack info string only. So I
   didn't
   use err_start function, and I didn't use CopyErrorData too. These
   routines does lot of other work.
  
  
1) I noticed that you did the initialization of edata manually,
just
wondering
can't we directly use CopyErrorData() which will do that job ?
   
  
   yes, we can, but in this context on context field is interesting
   for
   us.
  
I found that inside CopyErrorData() there is Assert:
   
Assert(CurrentMemoryContext != ErrorContext);
   
And in the patch we are setting using ErrorContext as short living
context,
is
it the reason of not using CopyErrorData() ?
  
   it was not a primary reason, but sure - this routine is not designed
   for direct using from elog module. Next, we can save one palloc
   call.
  
  
   Hmm ok.
  
  
  
   
   
2) To reset stack to empty, FlushErrorState() can be used.
   
  
   yes, it can be. I use explicit rows due different semantics,
   although
   it does same things. FlushErrorState some global ErrorState and it
   is
   used from other modules. I did a reset of ErrorContext. I fill a
   small
   difference there - so FlushErrorState is not best name for my
   purpose.
   What about modification:
  
   static void
   resetErrorStack()
   {
   errordata_stack_depth = -1;
   recursion_depth = 0;
   /* Delete all data in ErrorContext */
   MemoryContextResetAndDeleteChildren(ErrorContext);
   }
  
   and then call in  InvokeErrorCallbacks -- resetErrorStack()
  
   and rewrote flushErrorState like
  
   void FlushErrorState(void)
   {
 /* reset ErrorStack is enough */
 resetErrorStack();
   }
  
   ???
  
  
   Nope. rather then that I would still prefer direct call of
   FlushErrorState().
  
  
  
   but I can live well with direct call of FlushErrorState() - it is
   only
   minor change.
  
  
3) I was think about the usability of the feature and wondering
how
about if
we add some header to the stack, so user can differentiate between
STACK
and
normal NOTICE message ?
   
Something like:
   
postgres=# select outer_outer_func(10);
NOTICE:  - Call Stack -
PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS
PL/pgSQL function outer_func(integer) line 3 at RETURN
PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
CONTEXT:  PL/pgSQL function outer_func(integer) line 3 at RETURN
PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
 outer_outer_func
--
   20
(1 row)
  
   I understand, but I don't think so it is good idea. Sometimes, you
   would to use context for different purposes than debug log - for
   example - you should to identify top most call or near call - and
   any
   additional lines means some little bit more difficult parsing. You
   can
   add this line simply (if you want)
  
  

Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Claudio Freire
On Wed, Jun 26, 2013 at 10:25 AM, Andrew Dunstan and...@dunslane.net wrote:

 On 06/26/2013 09:14 AM, Bruce Momjian wrote:

 On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote:

 On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote:

 How should reviewers get credited in the release notes?

 a) not at all
 b) in a single block titled Reviewers for this version at the bottom.
 c) on the patch they reviewed, for each patch

 A weak preference for (c), with (b) running a close second.  As others
 have suggested, a review that leads to significant commitable changes
 to the patch should bump the credit to co-authorship.

 As a reminder, I tried a variant of C for 9.2 beta release notes, and
 got lots of complaints, particularly because the line describing the
 feature now had many more names on it.

 In my opinion, adding reviewer names to each feature item might result
 in the removal of all names from features.

 A poll is nice for gauging interest, but many people who vote don't
 understand the ramifications of what they are voting on.



 That's why I voted for b :-)

Yeah, with that in mind, I'd also switch to b.

I wouldn't complain, but if it's been tried and failed... what can I say?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn

2013-06-26 Thread Andres Freund
On 2013-06-26 12:07:54 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-06-17 16:16:22 +0200, Andres Freund wrote:
  Not accepting EWOULDBLOCK in the above if() results in:
  could not connect to server: Resource temporarily unavailable
  Is the server running locally and accepting
  connections on Unix domain socket /tmp/.s.PGSQL.5440?
  which makes more sense.
 
  Could I convince a committer to NACK or commit  backpatch that patch?
 
 Some trawling in the commit history shows that the current logic dates
 from my commit 6d0d838cebdf2bcd5c03f5449a1888f1e120496f, which unified
 Windows and non-Windows code paths; the check for EWOULDBLOCK was added
 in the earlier commit ca5a51627919c6fb6ab5e23739615a674caa4037 which
 (claimed to) add support for non-blocking connect on Windows.  So I'm
 concerned that your patch could break that platform.

 A possible compromise is
 
   {
   if (SOCK_ERRNO == EINPROGRESS ||
 +#ifndef WIN32
   SOCK_ERRNO == 
 EWOULDBLOCK ||
 +#endif
   SOCK_ERRNO == EINTR ||
   SOCK_ERRNO == 0)
 
 but I wonder whether it's safe to remove the case altogether.  Could
 anyone research the situation for non-blocking connect() on Windows?
 Perhaps on Windows we shouldn't test for EINPROGRESS at all?

The way EWOULDBLOCK is mentioned on msdn
(http://msdn.microsoft.com/en-us/library/windows/desktop/ms737625%28v=vs.85%29.aspx)
it indeed seems to be required.
I don't see how we could trigger the conditions for EINPROGRESS on
windows that msdn lists, but since we need it on unixoid systems and its
valid to treat the connect as partiall successfull on windows, there
seems little benefit in dropping it. So I guess the #ifdef you suggested
is the way to go.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Rodrigo Gonzalez
On Wed, 26 Jun 2013 09:14:07 -0400
Bruce Momjian br...@momjian.us wrote:

 On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote:
  On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote:
   How should reviewers get credited in the release notes?
  
   a) not at all
   b) in a single block titled Reviewers for this version at the
   bottom. c) on the patch they reviewed, for each patch
  
  A weak preference for (c), with (b) running a close second.  As
  others have suggested, a review that leads to significant
  commitable changes to the patch should bump the credit to
  co-authorship.
 
 As a reminder, I tried a variant of C for 9.2 beta release notes, and
 got lots of complaints, particularly because the line describing the
 feature now had many more names on it.

I am just someone that is thinking that maybe can review things...I am
not voting OK but I have a comment about your last email...
If people thinks (and with people I am not talking about myself but
regular committers and reviewers) think that option c is good, I think
that we should change the tool (or the way) that release notes are
doneI mean (and excuse my poor English) if people thing that it is
the way to go, we should make tools good enough for what people want
not change people thoughts cause tools are not good enough


 
 In my opinion, adding reviewer names to each feature item might result
 in the removal of all names from features.

Let's fix the way that release notes are done


 
 A poll is nice for gauging interest, but many people who vote don't
 understand the ramifications of what they are voting on.
 

I agree, but cost-benefit is what we should see here not just cost



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Bruce Momjian
On Wed, Jun 26, 2013 at 03:12:00PM -0300, Rodrigo Gonzalez wrote:
 On Wed, 26 Jun 2013 09:14:07 -0400
 Bruce Momjian br...@momjian.us wrote:
 
  On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote:
   On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote:
How should reviewers get credited in the release notes?
   
a) not at all
b) in a single block titled Reviewers for this version at the
bottom. c) on the patch they reviewed, for each patch
   
   A weak preference for (c), with (b) running a close second.  As
   others have suggested, a review that leads to significant
   commitable changes to the patch should bump the credit to
   co-authorship.
  
  As a reminder, I tried a variant of C for 9.2 beta release notes, and
  got lots of complaints, particularly because the line describing the
  feature now had many more names on it.
 
 I am just someone that is thinking that maybe can review things...I am
 not voting OK but I have a comment about your last email...
 If people thinks (and with people I am not talking about myself but
 regular committers and reviewers) think that option c is good, I think
 that we should change the tool (or the way) that release notes are
 doneI mean (and excuse my poor English) if people thing that it is
 the way to go, we should make tools good enough for what people want
 not change people thoughts cause tools are not good enough

Production of the release notes was not the problem;  it was the text in
the release notes.  I don't see how we could modify the release note
format.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Claudio Freire
On Wed, Jun 26, 2013 at 11:14 AM, Bruce Momjian br...@momjian.us wrote:
 On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote:
 In practice, there might be a lot of quirks and inefficiencies and
 locking contention etc. involved in various DBMS's, that you might
 be able to work around with hash partitioning. But from a
 theoretical point of view, there is no reason to expect just
 partitioning a table on a hash to make key-value lookups any faster.

 Good analysis.  Has anyone benchmarked this to know our btree is
 efficient in this area?


Yep. I had at one point, and came to the same conclusion. I ended up
building a few partial indices, and have been happy ever since.
Granted, my DB isn't that big, just around 200G.

No, I don't have the benchmark results. It's been a while. Back then,
it was 8.3, so I did the partitioning on the application. It still
wasn't worth it.

Now I just have two indices. One that indexes only hot tuples, it's
very heavily queried and works blazingly fast, and one that indexes by
(hotness, key). I include the hotness value on the query, and still
works quite fast enough. Luckily, I know things become cold after an
update to mark them cold, so I can do that. I included hotness on the
index to cluster updates on the hot part of the index, but I could
have just used a regular index and paid a small price on the updates.
Indeed, for a while it worked without the hotness, and there was no
significant trouble. I later found out that WAL bandwidth was
noticeably decreased when I added that hotness column, so I did, helps
a bit with replication. Has worked ever since.

Today, I only use partitioning to split conceptually different
tables, so I can have different schemas for each table (and normalize
with a view). Now it's 9.2, so the view works quite nicely and
transparently. I have yet to find a use for hash partitioning.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Rodrigo Gonzalez
On Wed, 26 Jun 2013 14:13:32 -0400
Bruce Momjian br...@momjian.us wrote:

 On Wed, Jun 26, 2013 at 03:12:00PM -0300, Rodrigo Gonzalez wrote:
  On Wed, 26 Jun 2013 09:14:07 -0400
  Bruce Momjian br...@momjian.us wrote:
  
   On Wed, Jun 26, 2013 at 10:40:17AM +1000, Brendan Jurd wrote:
On 26 June 2013 03:17, Josh Berkus j...@agliodbs.com wrote:
 How should reviewers get credited in the release notes?

 a) not at all
 b) in a single block titled Reviewers for this version at
 the bottom. c) on the patch they reviewed, for each patch

A weak preference for (c), with (b) running a close second.  As
others have suggested, a review that leads to significant
commitable changes to the patch should bump the credit to
co-authorship.
   
   As a reminder, I tried a variant of C for 9.2 beta release notes,
   and got lots of complaints, particularly because the line
   describing the feature now had many more names on it.
  
  I am just someone that is thinking that maybe can review things...I
  am not voting OK but I have a comment about your last email...
  If people thinks (and with people I am not talking about myself but
  regular committers and reviewers) think that option c is good, I
  think that we should change the tool (or the way) that release
  notes are doneI mean (and excuse my poor English) if people
  thing that it is the way to go, we should make tools good enough
  for what people want not change people thoughts cause tools are not
  good enough
 
 Production of the release notes was not the problem;  it was the text
 in the release notes.  I don't see how we could modify the release
 note format.
 

Well...

Checking release notes for 9.2.4

you have Fix insecure parsing of server command-line switches
(Mitsumasa Kondo, Kyotaro Horiguchi)

What about (it people think that it is good) a second () with reviewed
by someone


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Jeff Janes
On Wed, Jun 26, 2013 at 5:01 AM, Atri Sharma atri.j...@gmail.com wrote:


  The problem here is that if the hash table is in memory, doing a hash
  table lookup directly is likely to be cheaper than a bloom filter
  lookup, even if the bloom filter fits into the processor cache and the
  hash table doesn't (10 last level cache hits is slower than one cache
  miss). Bloom filter will help when its not feasible to use an actual
  hash table (lots of large keys), the real lookup is slow (e.g. an
  index lookup), you are doing a lot of lookups to amortize the
  construction cost and the condition is expected to be selective (most
  lookups give a negative). There might be some dataware house types of
  queries where it would help, but it seems like an awfully narrow use
  case with a potential for performance regressions when the planner has
  a bad estimate.

 Ok, sounds good. Cant we use bloom filters for the case where the hash
 table doesnt fit in memory? Specifically, when reading from disk is
 inevitable for accessing the hash table, we can use bloom filters for
 deciding which keys to actually read from disk.


I don't think that sounds all that promising.  When the hash table does not
fit in memory, it is either partitioned into multiple passes, each of which
do fit in memory, or it chooses a different plan altogether.  Do we know
under what conditions a Bloom filter would be superior to those options,
and could we reliably detect those conditions?

Cheers,

Jeff


Re: [HACKERS] MD5 aggregate

2013-06-26 Thread Noah Misch
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
 I've been playing around with the idea of an aggregate that computes
 the sum of the md5 hashes of each of its inputs, which I've called
 md5_total() for now, although I'm not particularly wedded to that
 name. Comparing it with md5_agg() on a 100M row table (see attached
 test script) produces interesting results:
 
 SELECT md5_agg(foo.*::text)
   FROM (SELECT * FROM foo ORDER BY id) foo;
 
  50bc42127fb9b028c9708248f835ed8f
 
 Time: 92960.021 ms
 
 SELECT md5_total(foo.*::text) FROM foo;
 
  02faea7fafee4d253fc94cfae031afc43c03479c
 
 Time: 96190.343 ms
 
 Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
 result is longer) but it seems like it would be very useful for
 quickly comparing data in SQL, since its value is not dependent on the
 row-order making it easier to use and better performing if there is no
 usable index for ordering.

I took a look at this patch.  First, as Peter suggested upthread, it should
have been two patches: one to optimize calculateDigestFromBuffer() and
callees, another atop the first adding new SQL functions and adjusting the
infrastructure to meet their needs.

 + primarymd5_total/primary
 +/indexterm
 +functionmd5_total(replaceable 
 class=parameterexpression/replaceable)/function
 +   /entry
 +   entrytypetext/type or typebytea/type/entry
 +   entrytypetext/type/entry
 +   entry
 +sum of the MD5 hash values of the input values, independent of their
 +order

This is not specific enough.  We would need to provide either an algorithm
specification or a literature reference.  However, that just leads to the
problem that we should not put a literature-unrecognized cryptographic
algorithm into core PostgreSQL.  I realize that the use case inspiring this
patch wasn't concerned with cryptographic properties, but the association with
md5 immediately makes it a cryptography offering.

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible.  However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally.  (We do carry legacy stuff that wouldn't make
the cut today.)  In my estimation, md5_agg() does not make that cut.  The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability.  It could just as well live in an extension
published on PGXN.  Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

 *** a/src/backend/libpq/md5.c
 --- b/src/backend/libpq/md5.c
 ***
 *** 1,14 
   /*
*  md5.c
*
 !  *  Implements  the  MD5 Message-Digest Algorithm as specified in
 !  *  RFC  1321.  This  implementation  is a simple one, in that it
 !  *  needs  every  input  byte  to  be  buffered  before doing any
 !  *  calculations.  I  do  not  expect  this  file  to be used for
 !  *  general  purpose  MD5'ing  of large amounts of data, only for
 !  *  generating hashed passwords from limited input.

In other words, performance wasn't a design criterion.  I wonder if we would
be wasting our time with a narrow performance change like removing the data
copy.  What's your opinion of copying pgcrypto's md5 implementation, which
already avoids the data copy?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Bruce Momjian
On Wed, Jun 26, 2013 at 03:22:06PM -0300, Rodrigo Gonzalez wrote:
 On Wed, 26 Jun 2013 14:13:32 -0400
  Production of the release notes was not the problem;  it was the text
  in the release notes.  I don't see how we could modify the release
  note format.
  
 
 Well...
 
 Checking release notes for 9.2.4
 
 you have Fix insecure parsing of server command-line switches
 (Mitsumasa Kondo, Kyotaro Horiguchi)
 
 What about (it people think that it is good) a second () with reviewed
 by someone

That's what we had, and people didn't like it.  If we overload that list
of names, we might find we want to remove all the names.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-06-26 Thread Atri Sharma
On Thu, Jun 27, 2013 at 12:01 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 I don't think that sounds all that promising.  When the hash table does not
 fit in memory, it is either partitioned into multiple passes, each of which
 do fit in memory, or it chooses a different plan altogether.

Yeah, my point is, we could potentially be looking at not bringing in
all of the memory blocks for the hash tables. Even if they are
processed in parts, we still are looking at all of them.

Why not do a local bloom filter lookup, and never bring in the tuples
that are given negative the bloom filter? This could save us some
reads anyhow.

 Do we know
 under what conditions a Bloom filter would be superior to those options, and
 could we reliably detect those conditions?

Yes, this needs to be researched.

Regards,

Atri




--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-26 Thread Josh Berkus
On 06/26/2013 12:29 AM, Szymon Guz wrote:
 OK, so I think this patch can be committed, I will change the status.

Can we have a full review before you mark it ready for committer?  How
did you test it?  What kinds of review have you done?

The committer can't know whether it's ready or not if he doesn't have a
full report from you.

Thanks!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-26 Thread Szymon Guz
On 26 June 2013 20:55, Josh Berkus j...@agliodbs.com wrote:

 On 06/26/2013 12:29 AM, Szymon Guz wrote:
  OK, so I think this patch can be committed, I will change the status.

 Can we have a full review before you mark it ready for committer?  How
 did you test it?  What kinds of review have you done?

 The committer can't know whether it's ready or not if he doesn't have a
 full report from you.

 Thanks!


Hi Josh,
I will add more detailed descriptions to all patches I set as read for
committer.

Szymon


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Alvaro Herrera
Bruce Momjian escribió:
 On Wed, Jun 26, 2013 at 03:22:06PM -0300, Rodrigo Gonzalez wrote:

  Checking release notes for 9.2.4
  
  you have Fix insecure parsing of server command-line switches
  (Mitsumasa Kondo, Kyotaro Horiguchi)
  
  What about (it people think that it is good) a second () with reviewed
  by someone
 
 That's what we had, and people didn't like it.  If we overload that list
 of names, we might find we want to remove all the names.

Yeah, it becomes too long.  (For security patches, in particular, it's
probably not wise to list reviewers; there might well be reviewers whose
input you never see because they happened in the closed security list).


See the entry for foreign key locks:

  Prevent non-key-field row updates from locking foreign key rows (Álvaro
  Herrera, Noah Misch, Andres Freund, Alexander Shulgin, Marti Raudsepp)

I am the author of most of the code, yet I chose to add Noah and Andres
because they contributed a huge amount of time to reviewing the patch,
and Alex and Marti because they submitted some code.  They are all
listed as coauthors, which seems a reasonable compromise to me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >