Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Mark Kirkwood

On 14/09/14 05:36, Rohit Goyal wrote:

Hi All,

I want to work on the code of intermediate dataset of select and update
query.

For example.

Rohit's salary has been updated 4 times, so it has 4 different version
of salary.

I want to select  salary of person named Rohit. Now suppose , in
intermediate result, I found 4 different versions of the data. I want to
know the code portion which i need to look for working on all 4 versions
in dataset. :)




Hi Rohit,

Currently in Postgres, these intermediate versions all exist - however a 
given session can only see one of them. Also VACUUM is allowed to 
destroy versions that no other transactions can see.


So if I'm understanding you correctly, you would like to have some way 
for a session to see *all* these versions (and I guess preventing VACUUM 
from destroying them).


It is certainly possible (or used to be via snapshot manipulation, I 
haven't looked at that code in a while sorry) to enable a session to see 
all the old versions, and is quite a cool idea (Postgres used to have 
this ability in older versions - called Time Travel).


For pure practicality, this can be achieved without any code 
modifications using TRIGGERs and an extra table (as Gavin alludes to).


Do tell us a bit more about what you are wanting to do!

Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-13 Thread Amit Kapila
On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila 
wrote:
> On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund 
wrote:
> > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> > > +++ b/src/backend/postmaster/bgreclaimer.c
> >
> > A fair number of comments in that file refer to bgwriter...
>
> Will fix.

Fixed.

> > > @@ -0,0 +1,302 @@
> > >
+/*-
> > > + *
> > > + * bgreclaimer.c
> > > + *
> > > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.
 It
> > > + * attempts to keep regular backends from having to run clock sweep
(which
> > > + * they would only do when they don't find a usable shared buffer
from
> > > + * freelist to read in another page).
> >
> > That's not really accurate. Freelist pages are often also needed to
> > write new pages, without reading anything in.
>
> Agreed, but the same is used in bgwriter file as well; so if we
> change here, we might want to change bgwriter file header as well.

I have just changed bgreclaimer for this comment, if the same
is required for bgwriter, I can create a separate patch as that
change is not related to this patch, so I thought it is better
to keep it separate.

> > I'd phrase it as "which
> > they only need to do if they don't find a victim buffer from the
> > freelist"
>
> victim buffer sounds more like a buffer which it will get from
> clock sweep, how about next candidate (same is used in function
> header of StrategyGetBuffer()).

Fixed.

> > >  In the best scenario all requests
> > > + * for shared buffers will be fulfilled from freelist as the
background
> > > + * reclaimer process always tries to maintain buffers on freelist.
 However,
> > > + * regular backends are still empowered to run clock sweep to find a
usable
> > > + * buffer if the bgreclaimer fails to maintain enough buffers on
freelist.
> >
> > "empowered" sounds strange to me. 'still can run the clock sweep'?
>
> No harm in changing like what you are suggesting, however the same is
> used in file header of bgwriter.c as well, so I think lets keep this
usage as
> it is because there is no correctness issue here.

Not changed anything for this comment.



> >
> > > +static void bgreclaim_quickdie(SIGNAL_ARGS);
> > > +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> > > +static void ReqShutdownHandler(SIGNAL_ARGS);
> > > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
>
> I have kept based on bgwriter, so not sure if it's good to change.
> However I we want consistent in naming, I would like to keep
> something like:
>
> ReclaimShutdownHandler
> ReclaimQuickDieHandler
> ..
> ..

Changed function names to make them consistent.

> > > + /*
> > > +  * If an exception is encountered, processing resumes here.
> > > +  *
> > > +  * See notes in postgres.c about the design of this coding.
> > > +  */
> > > + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> > > + {
> > > + /* Since not using PG_TRY, must reset error stack by
hand */
> > > + error_context_stack = NULL;
> ..
> >
> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
>
> I will add LWLockReleaseAll() in exception handling as discussed
> elsewhere in thread.

Done.

>
> > > +
> > > + /* Now we can allow interrupts again */
> > > + RESUME_INTERRUPTS();
> >
> > Other processes sleep for a second here, I think that's a good
> > idea. E.g. that bit:
>
> Agreed, will make change as per suggestion.

Done.

> >
> > > + /*
> > > +  * Loop forever
> > > +  */
> > > + for (;;)
> > > + {
> > > + int rc;
> > > +
> > > +
> > > + /*
> > > +  * Backend will signal bgreclaimer when the number of
buffers in
> > > +  * freelist falls below than low water mark of freelist.
> > > +  */
> > > + rc = WaitLatch(&MyProc->procLatch,
> > > +WL_LATCH_SET |
WL_POSTMASTER_DEATH,
> > > +-1);
> >
> > That's probably not going to work well directly after a (re)start of
> > bgreclaim (depending on how you handle the water mark, I'll see in a
> > bit).
>
> Could you please be more specific here?

I wasn't sure if any change is required here, so kept the code
as it is.

> >
> > > +Background Reclaimer's Processing
> > > +-
> ..
> > > +Two water mark indicators are used to maintain sufficient number of
buffers
> > > +on freelist.  Low water mark indicator is used by backends to wake
bgreclaimer
> > > +when the number of buffers in freelist falls below it.  High water
mark
> > > +indicator is used by bgreclaimer to move buffers to freelist.
> >
> > For me the description of the high water as stated here doesn't seem to
> > explain anything.
> >
> >

Re: [HACKERS] PoC: Partial sort

2014-09-13 Thread Peter Geoghegan
Some quick comments on partial-sort-basic-2.patch:

> *** a/src/include/utils/tuplesort.h
> --- b/src/include/utils/tuplesort.h
> ***
> *** 24,29 
> --- 24,30 
>   #include "executor/tuptable.h"
>   #include "fmgr.h"
>   #include "utils/relcache.h"
> + #include "utils/sortsupport.h"

Why include sortsupport.h here?

I would like to see more comments, especially within ExecSort(). The
structure of that routine is quite unclear.

I don't really like this MakeSortSupportKeys() stuff within ExecSort():

> !   /* Support structures for cmpSortSkipCols - already sorted columns */
> !   if (skipCols)
> !   node->skipKeys = MakeSortSupportKeys(skipCols,
> !plannode->sortColIdx,
> !plannode->sortOperators,
> !plannode->collations,
> !plannode->nullsFirst);
>
> +   /* Only pass on remaining columns that are unsorted */
> tuplesortstate = tuplesort_begin_heap(tupDesc,
> ! plannode->numCols - skipCols,
> ! 
> &(plannode->sortColIdx[skipCols]),
> ! 
> &(plannode->sortOperators[skipCols]),
> ! 
> &(plannode->collations[skipCols]),
> ! 
> &(plannode->nullsFirst[skipCols]),
>   work_mem,
>   node->randomAccess);

You're calling the new function MakeSortSupportKeys() (which
encapsulates setting up sortsupport state for all attributes) twice;
first, to populate the skip keys (the indexed attribute(s)), and
second, when tuplesort_begin_heap() is called, so that there is state
for unindexed sort groups that must be manually sorted. That doesn't
seem great.

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

In this struct:

> *** a/src/include/nodes/execnodes.h
> --- b/src/include/nodes/execnodes.h
> *** typedef struct SortState
> *** 1670,1678 
> --- 1670,1682 
>  boolbounded;/* is the result set bounded? */
>  int64   bound;  /* if bounded, how many tuples are needed */
>  boolsort_Done;  /* sort completed yet? */
> +   boolfinished;   /* fetching tuples from outer node
> +  is finished ? */
>  boolbounded_Done;   /* value of bounded we did the sort with */
>  int64   bound_Done; /* value of bound we did the sort with */
>  void   *tuplesortstate; /* private state of tuplesort.c */
> +   SortSupport skipKeys;   /* columns already sorted in input */
> +   HeapTuple   prev;   /* previous tuple from outer node */
>   } SortState;

I think you should be clearer about the scope and duration of fields
like "finished", if this really belongs here. In general, there should
be some high-level comments about how the feature added by the patch
fits together, which there isn't right now.

That's all I have for now.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-09-13 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
 wrote:
> Actually, higher cardinality skip columns is better. Sorting of smaller
> groups is faster than sorting larger groups of same size. Also, with smaller
> groups you achieve limit more accurate (in average), i.e. sort smaller
> amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column "v1") with much lower
cardinality/n_distinct than the column that had to be sorted on
(column "v2"). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

>> I'm not sure if that's worth it to more or less
>> duplicate heap_tuple_attr_equals() to save a "mere" n expensive
>> comparisons, but it's something to think about (actually, there are
>> probably less than even n comparisons in practice because there'll be
>> a limit).
>
> Not correct. Smaller groups are not OK. Imagine that two representations of
> same skip column value exists. Index may return them in any order, even
> change them one by one. In this case sorting on other column never takes
> place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive "real" comparators, not just text or numeric.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Claudio Freire
On Sat, Sep 13, 2014 at 10:27 PM, k...@rice.edu  wrote:
>> Here're the results running the test program on my i5-4200M
>>
>> crc sb8: 90444623
>> elapsed: 0.513688s
>> speed: 1.485220 GB/s
>>
>> crc hw: 90444623
>> elapsed: 0.048327s
>> speed: 15.786877 GB/s
>>
>> xxhash: 7f4a8d5
>> elapsed: 0.182100s
>> speed: 4.189663 GB/s
>>
>> The hardware version is insanely and works on the majority of Postgres
>> setups and the fallback software implementations is 2.8x slower than the
>> fastest 32bit hash around.
>>
>> Hopefully it'll be useful in the discussion.
>
> Thank you for running this sample benchmark. It definitely shows that the
> hardware version of the CRC is very fast, unfortunately it is really only
> available on x64 Intel/AMD processors which leaves all the rest lacking.
> For current 64-bit hardware, it might be instructive to also try using
> the XXH64 version and just take one half of the hash. It should come in
> at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
> CRC. Also, while I understand that CRC has a very venerable history and
> is well studied for transmission type errors, I have been unable to find
> any research on its applicability to validating file/block writes to a
> disk drive. While it is to quote you "unbeaten collision wise", xxhash,
> both the 32-bit and 64-bit version are its equal. Since there seems to
> be a lack of research on disk based error detection versus CRC polynomials,
> it seems likely that any of the proposed hash functions are on an equal
> footing in this regard. As Andres commented up-thread, xxhash comes along
> for "free" with lz4.


Bear in mind that

a) taking half of the CRC will invalidate all error detection
capability research, and it may also invalidate its properties,
depending on the CRC itself.

b) bit corruption as is the target kind of error for CRC are resurging
in SSDs, as can be seen in table 4 of a link that I think appeared on
this same list:
https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

I would totally forget of taking half of whatever CRC. That's looking
for pain, in that it will invalidate all existing and future research
on that hash/CRC type.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Arthur Silva
On Sat, Sep 13, 2014 at 10:27 PM, k...@rice.edu  wrote:

> On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
> > On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane  wrote:
> >
> > > Andres Freund  writes:
> > > > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> > > >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva 
> > > wrote:
> > > >>> That's not entirely true. CRC-32C beats pretty much everything with
> > > the same
> > > >>> length quality-wise and has both hardware implementations and
> highly
> > > >>> optimized software versions.
> > >
> > > >> For better or for worse CRC is biased by detecting all single bit
> > > >> errors, the detection capability of larger errors is slightly
> > > >> diminished. The quality of the other algorithms I mentioned is also
> > > >> very good, while producing uniformly varying output.
> > >
> > > > There's also much more literature about the various CRCs in
> comparison
> > > > to some of these hash allgorithms.
> > >
> > > Indeed.  CRCs have well-understood properties for error detection.
> > > Have any of these new algorithms been analyzed even a hundredth as
> > > thoroughly?  No.  I'm unimpressed by evidence-free claims that
> > > something else is "also very good".
> > >
> > > Now, CRCs are designed for detecting the sorts of short burst errors
> > > that are (or were, back in the day) common on phone lines.  You could
> > > certainly make an argument that that's not the type of threat we face
> > > for PG data.  However, I've not seen anyone actually make such an
> > > argument, let alone demonstrate that some other algorithm would be
> better.
> > > To start with, you'd need to explain precisely what other error pattern
> > > is more important to defend against, and why.
> > >
> > > regards, tom lane
> > >
> >
> > Mysql went this way as well, changing the CRC polynomial in 5.6.
> >
> > What we are looking for here is uniqueness thus better error detection.
> Not
> > avalanche effect, nor cryptographically secure, nor bit distribution.
> > As far as I'm aware CRC32C is unbeaten collision wise and time proven.
> >
> > I couldn't find tests with xxhash and crc32 on the same hardware so I
> spent
> > some time putting together a benchmark (see attachment, to run it just
> > start run.sh)
> >
> > I included a crc32 implementation using ssr4.2 instructions (which works
> on
> > pretty much any Intel processor built after 2008 and AMD built after
> 2012),
> > a portable Slice-By-8 software implementation and xxhash since it's the
> > fastest software 32bit hash I know of.
> >
> > Here're the results running the test program on my i5-4200M
> >
> > crc sb8: 90444623
> > elapsed: 0.513688s
> > speed: 1.485220 GB/s
> >
> > crc hw: 90444623
> > elapsed: 0.048327s
> > speed: 15.786877 GB/s
> >
> > xxhash: 7f4a8d5
> > elapsed: 0.182100s
> > speed: 4.189663 GB/s
> >
> > The hardware version is insanely and works on the majority of Postgres
> > setups and the fallback software implementations is 2.8x slower than the
> > fastest 32bit hash around.
> >
> > Hopefully it'll be useful in the discussion.
>
> Thank you for running this sample benchmark. It definitely shows that the
> hardware version of the CRC is very fast, unfortunately it is really only
> available on x64 Intel/AMD processors which leaves all the rest lacking.
> For current 64-bit hardware, it might be instructive to also try using
> the XXH64 version and just take one half of the hash. It should come in
> at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
> CRC. Also, while I understand that CRC has a very venerable history and
> is well studied for transmission type errors, I have been unable to find
> any research on its applicability to validating file/block writes to a
> disk drive. While it is to quote you "unbeaten collision wise", xxhash,
> both the 32-bit and 64-bit version are its equal. Since there seems to
> be a lack of research on disk based error detection versus CRC polynomials,
> it seems likely that any of the proposed hash functions are on an equal
> footing in this regard. As Andres commented up-thread, xxhash comes along
> for "free" with lz4.
>
> Regards,
> Ken
>

For the sake of completeness the results for xxhash64 in my machine

xxhash64
speed:  7.365398 GB/s

Which is indeed very fast.


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread k...@rice.edu
On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
> On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane  wrote:
> 
> > Andres Freund  writes:
> > > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> > >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva 
> > wrote:
> > >>> That's not entirely true. CRC-32C beats pretty much everything with
> > the same
> > >>> length quality-wise and has both hardware implementations and highly
> > >>> optimized software versions.
> >
> > >> For better or for worse CRC is biased by detecting all single bit
> > >> errors, the detection capability of larger errors is slightly
> > >> diminished. The quality of the other algorithms I mentioned is also
> > >> very good, while producing uniformly varying output.
> >
> > > There's also much more literature about the various CRCs in comparison
> > > to some of these hash allgorithms.
> >
> > Indeed.  CRCs have well-understood properties for error detection.
> > Have any of these new algorithms been analyzed even a hundredth as
> > thoroughly?  No.  I'm unimpressed by evidence-free claims that
> > something else is "also very good".
> >
> > Now, CRCs are designed for detecting the sorts of short burst errors
> > that are (or were, back in the day) common on phone lines.  You could
> > certainly make an argument that that's not the type of threat we face
> > for PG data.  However, I've not seen anyone actually make such an
> > argument, let alone demonstrate that some other algorithm would be better.
> > To start with, you'd need to explain precisely what other error pattern
> > is more important to defend against, and why.
> >
> > regards, tom lane
> >
> 
> Mysql went this way as well, changing the CRC polynomial in 5.6.
> 
> What we are looking for here is uniqueness thus better error detection. Not
> avalanche effect, nor cryptographically secure, nor bit distribution.
> As far as I'm aware CRC32C is unbeaten collision wise and time proven.
> 
> I couldn't find tests with xxhash and crc32 on the same hardware so I spent
> some time putting together a benchmark (see attachment, to run it just
> start run.sh)
> 
> I included a crc32 implementation using ssr4.2 instructions (which works on
> pretty much any Intel processor built after 2008 and AMD built after 2012),
> a portable Slice-By-8 software implementation and xxhash since it's the
> fastest software 32bit hash I know of.
> 
> Here're the results running the test program on my i5-4200M
> 
> crc sb8: 90444623
> elapsed: 0.513688s
> speed: 1.485220 GB/s
> 
> crc hw: 90444623
> elapsed: 0.048327s
> speed: 15.786877 GB/s
> 
> xxhash: 7f4a8d5
> elapsed: 0.182100s
> speed: 4.189663 GB/s
> 
> The hardware version is insanely and works on the majority of Postgres
> setups and the fallback software implementations is 2.8x slower than the
> fastest 32bit hash around.
> 
> Hopefully it'll be useful in the discussion.

Thank you for running this sample benchmark. It definitely shows that the
hardware version of the CRC is very fast, unfortunately it is really only
available on x64 Intel/AMD processors which leaves all the rest lacking.
For current 64-bit hardware, it might be instructive to also try using
the XXH64 version and just take one half of the hash. It should come in
at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
CRC. Also, while I understand that CRC has a very venerable history and
is well studied for transmission type errors, I have been unable to find
any research on its applicability to validating file/block writes to a
disk drive. While it is to quote you "unbeaten collision wise", xxhash,
both the 32-bit and 64-bit version are its equal. Since there seems to
be a lack of research on disk based error detection versus CRC polynomials,
it seems likely that any of the proposed hash functions are on an equal
footing in this regard. As Andres commented up-thread, xxhash comes along
for "free" with lz4.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Audit of logout

2014-09-13 Thread Tom Lane
Amit Kapila  writes:
> Yes, it was failing only on windows.  Today when I further tried to
> look into the issue, I found that if I rebuild plpgsql, it didn't occurred.
> So the conclusion is that it occurred due to build mismatch, however I
> am not sure why a rebuild of plpgsql is required for this patch.
> Sorry for the noise.

plpgsql contains references to PGC_SUSET and PGC_USERSET, whose values are
changed by this patch, so failure is unsurprising if you don't rebuild it.
(I saw failures in the regression tests even without Windows until I
updated plpgsql.)

Committed with minor cosmetic adjustments; mostly, rewriting some
comments.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] run xmllint during build (was Re: need xmllint on borka)

2014-09-13 Thread Peter Eisentraut
On 8/21/14 4:28 PM, Fabien COELHO wrote:
> I'm fine with "$(missing)" or whatever else that provides a clear message.

I've committed the $(missing) use separately, and rebased this patch on
top of that.

>From cb2787b2473186239eb23e7320654513df8e8fb7 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 13 Sep 2014 20:42:16 -0400
Subject: [PATCH v2] doc: Check DocBook XML validity during the build

Building the documentation with XSLT does not check the DTD, like a
DSSSL build would.  One can often get away with having invalid XML, but
the stylesheets might then create incorrect output, as they are not
designed to handle that.  Therefore, check the validity of the XML
against the DTD, using xmllint, during the build.

Add xmllint detection to configure, and add some documentation.

xmllint comes with libxml2, which is already in use, but it might be in
a separate package, such as libxml2-utils on Debian.
---
 configure  | 43 +++
 configure.in   |  1 +
 doc/src/sgml/Makefile  |  9 -
 doc/src/sgml/docguide.sgml | 19 +--
 src/Makefile.global.in |  1 +
 5 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index 4918f95..873bcff 100755
--- a/configure
+++ b/configure
@@ -630,6 +630,7 @@ vpath_build
 PROVE
 OSX
 XSLTPROC
+XMLLINT
 DBTOEPUB
 COLLATEINDEX
 DOCBOOKSTYLE
@@ -14449,6 +14450,48 @@ fi
   test -n "$DBTOEPUB" && break
 done
 
+for ac_prog in xmllint
+do
+  # Extract the first word of "$ac_prog", so it can be a program name with args.
+set dummy $ac_prog; ac_word=$2
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5
+$as_echo_n "checking for $ac_word... " >&6; }
+if ${ac_cv_prog_XMLLINT+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if test -n "$XMLLINT"; then
+  ac_cv_prog_XMLLINT="$XMLLINT" # Let the user override the test.
+else
+as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+for ac_exec_ext in '' $ac_executable_extensions; do
+  if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then
+ac_cv_prog_XMLLINT="$ac_prog"
+$as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" >&5
+break 2
+  fi
+done
+  done
+IFS=$as_save_IFS
+
+fi
+fi
+XMLLINT=$ac_cv_prog_XMLLINT
+if test -n "$XMLLINT"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $XMLLINT" >&5
+$as_echo "$XMLLINT" >&6; }
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
+
+
+  test -n "$XMLLINT" && break
+done
+
 for ac_prog in xsltproc
 do
   # Extract the first word of "$ac_prog", so it can be a program name with args.
diff --git a/configure.in b/configure.in
index 1392277..e3b17ea 100644
--- a/configure.in
+++ b/configure.in
@@ -1859,6 +1859,7 @@ PGAC_CHECK_DOCBOOK(4.2)
 PGAC_PATH_DOCBOOK_STYLESHEETS
 PGAC_PATH_COLLATEINDEX
 AC_CHECK_PROGS(DBTOEPUB, dbtoepub)
+AC_CHECK_PROGS(XMLLINT, xmllint)
 AC_CHECK_PROGS(XSLTPROC, xsltproc)
 AC_CHECK_PROGS(OSX, [osx sgml2xml sx])
 
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 1d42be8..8bdd26c 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -44,6 +44,10 @@ ifndef OSX
 OSX = $(missing) osx
 endif
 
+ifndef XMLLINT
+XMLLINT = $(missing) xmllint
+endif
+
 ifndef XSLTPROC
 XSLTPROC = $(missing) xsltproc
 endif
@@ -78,6 +82,7 @@ override SPFLAGS += -wall -wno-unused-param -wno-empty -wfully-tagged
 man distprep-man: man-stamp
 
 man-stamp: stylesheet-man.xsl postgres.xml
+	$(XMLLINT) --noout --valid postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_MAN_FLAGS) $^
 	touch $@
 
@@ -254,11 +259,13 @@ endif
 xslthtml: xslthtml-stamp
 
 xslthtml-stamp: stylesheet.xsl postgres.xml
+	$(XMLLINT) --noout --valid postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) $^
 	cp $(srcdir)/stylesheet.css html/
 	touch $@
 
 htmlhelp: stylesheet-hh.xsl postgres.xml
+	$(XMLLINT) --noout --valid postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $^
 
 %-A4.fo.tmp: stylesheet-fo.xsl %.xml
@@ -268,7 +275,6 @@ htmlhelp: stylesheet-hh.xsl postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) --stringparam paper.type USletter -o $@ $^
 
 FOP = fop
-XMLLINT = xmllint
 
 # reformat FO output so that locations of errors are easier to find
 %.fo: %.fo.tmp
@@ -281,6 +287,7 @@ XMLLINT = xmllint
 
 epub: postgres.epub
 postgres.epub: postgres.xml
+	$(XMLLINT) --noout --valid $<
 	$(DBTOEPUB) $<
 
 
diff --git a/doc/src/sgml/docguide.sgml b/doc/src/sgml/docguide.sgml
index 943ab5a..614a76c 100644
--- a/doc/src/sgml/docguide.sgml
+++ b/doc/src/sgml/docguide.sgml
@@ -149,6 +149,20 @@ Tool Sets
 
 
 
+ http://xmlsoft.org/";>Libxml2 for xmllint
+ 
+  
+   This library and the xmllint tool it contains are
+   used for processing XML.  Many developers will already
+   have Libxml2 installed, because it is also
+   used when building the PostgreSQL code.  Note, ho

Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-13 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 6:40 AM, Alexander Korotkov
 wrote:
> It's likely that "JB_ROOT_COUNT(val) < JB_ROOT_COUNT(tmpl)" should be
> checked only for objects, not arrays. Also, should JsonbDeepContains does
> same fast check when it deals with nested objects?

Attached patch implements something similar to what you describe here,
fixing your example.

I haven't added the optimization to JsonbDeepContains(). I think that
if anything, we should remove the optimization entirely, which is what
I've done -- an rhs "is it contained within?" value is hardly ever
going to be an object that has more pairs than the object we're
checking it is contained within. It's almost certainly going to have
far fewer pairs. Apart from only applying to objects, that
optimization just isn't an effective way of eliminating jsonb values
from consideration quickly. I'd rather not bother at all, rather than
having a complicated comment about why the optimization applies to
objects and not arrays.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/adt/jsonb_op.c b/src/backend/utils/adt/jsonb_op.c
new file mode 100644
index 2d071b2..6fcdbad
*** a/src/backend/utils/adt/jsonb_op.c
--- b/src/backend/utils/adt/jsonb_op.c
*** jsonb_contains(PG_FUNCTION_ARGS)
*** 117,124 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_COUNT(val) < JB_ROOT_COUNT(tmpl) ||
! 		JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(&val->root);
--- 117,123 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(&val->root);
*** jsonb_contained(PG_FUNCTION_ARGS)
*** 137,144 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_COUNT(val) < JB_ROOT_COUNT(tmpl) ||
! 		JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(&val->root);
--- 136,142 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(&val->root);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A mechanism securing web applications in DBMS

2014-09-13 Thread Stephen Frost
Zhaomo,

* Zhaomo Yang (zhy...@cs.ucsd.edu) wrote:
> > Have you considered just using a regular, but unlogged, table?  That
> > would also avoid any risk that the application manages to drop or shadow
> > the temp table somehow with a "fake" table that changes who is currently
> > authenticated, and avoids having to figure out how to deal with the temp
> > table vanishing due to the connections going away.
> 
> So then all the currently logged in users will be stored in the same
> table, which means we also need to make sure that the correct row in
> that table is used when the row-level security policy refers to the
> current application-level user.

Yes- but that's pretty trivially done, given that you've stipulated that
a single connection DB connection must be used from authentication until
de-authentication.  All that is needed is an additional column in the
auth table which is populated with a pseudo-random value which is
guaranteed to be unique and constant for the duration of the
authenticated time- and the database backend PID is perfect for that.
The auth function can call the pg_backend_pid() function directly and
then the policies can include a 'pid = pg_backend_pid()' as part of the
join to the auth table.  The auth function can also complain loudly if
an entry in the pid table is found with the current PID during auth (and
similar- the de-auth function can complain if an entry with the current
PID is *not* found).  This would eliminate the need for the on-connect
triggers, I believe (though those are interesting for other reasons..).

> Let me send you a copy of our paper in a separate email which is a
> thorough description of the mechanism (including background, threat
> model, how it works, etc), which should give you an better idea on
> every aspect of the mechanism. Please do not distribute it because it
> has been accepted for publication. Note that the implementation we
> show in the paper is just a prototype (we made the changes so that we
> could implement it quickly). Our goal always is to integrate our
> mechanism into open source DBMS's like PG and MySQL cleanly.

It'd be very interesting to see this done with the unlogged table,
security definer functions, and the row-level policies patch which we're
working on.  I'd further suggest that the application also use multiple
roles which are set noinherit and 'set role' based on the operation
which it's currently being used for- this would add another level of
protection.  Using stored procedures (for more than just the auth and
de-auth functions as suggested here) can also be a good idea.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Gavin Flower

On 14/09/14 06:35, Atri Sharma wrote:



On Sat, Sep 13, 2014 at 11:52 PM, David G Johnston 
mailto:david.g.johns...@gmail.com>> wrote:


Atri Sharma wrote
> On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal <



Or rather even if you want to be able to reference the older
versions of
that record there is nothing in PostgreSQL to facilitate that. You
have to
manually create and manage the data so that you know during what
time period
a given record is valid.

David J.





Sometimes I do miss 'time travel' we used to have :)

Regards,

Atri
--
Regards,
Atri
/l'apprenant/
That is only because the Guild of Time Travellers was formed, and we are 
very selective in whom we allow to join.  It was a massive undertaking 
to purge the knowledge of effective time travel from the general 
populace (H. G. Wells had to be expelled with a partial brain wipe)!   :-)


On a more serious note:
I did design and implement a system to allow what the original poster 
was after, it involved 2 tables for each logical table, and used both an 
EFFECTIVE_DATE & an AS_AT_DATE.  This allowed insurance quotes to be 
valid for a given of time, even if the insurance rates were set change 
after the quote was given (but before the quote expired).  This was 
about 15 years ago. It was amusing that my wife joined that team 10 
years after I left, and found 2 of my original colleagues still there!



Cheers,
Gavin


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-13 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas  wrote:
> Based on discussion thus far it seems that there's a possibility that
> the trade-off may be different for short strings vs. long strings.  If
> the string is small enough to fit in the L1 CPU cache, then it may be
> that memcmp() followed by strcoll() is not much more expensive than
> strcoll().  That should be easy to figure out: write a standalone C
> program that creates a bunch of arbitrary, fairly-short strings, say
> 32 bytes, in a big array.  Make the strings different near the end,
> but not at the beginning.  Then have the program either do strcoll()
> on every pair (O(n^2)) or, with a #define, memcmp() followed by
> strcoll() on every pair.  It should be easy to see whether the
> memcmp() adds 1% or 25% or 100%.

I get where you're coming from now (I think). It seems like you're
interested in getting a sense of what it would cost to do an
opportunistic memcmp() in a universe where memory latency isn't by far
the dominant cost (which it clearly is, as evidenced by my most recent
benchmark where a variant of Heikki's highly unsympathetic SQL test
case showed a ~1% regression). You've described a case with totally
predictable access patterns, perfectly suited to prefetching, and with
no other memory access bottlenecks. As I've said, this seems
reasonable (at least with those caveats in mind).

The answer to your high level question appears to be: the
implementation (the totally opportunistic "memcmp() == 0" thing)
benefits from the fact that we're always bottlenecked on memory, and
to a fairly appreciable degree. I highly doubt that this is something
that can fail to work out with real SQL queries, but anything that
furthers our understanding of the optimization is a good thing. Of
course, within tuplesort we're not really getting the totally
opportunistic memcmp()s "for free" - rather, we're using a resource
that we would not otherwise be able to use at all.

This graph illustrates the historic trends of CPU and memory performance:

http://www.cs.virginia.edu/stream/stream_logo.gif

I find this imbalance quite remarkable - no wonder researchers are
finding ways to make the best of the situation. To make matters worse,
the per-core trends for memory bandwidth are now actually *negative
growth*. We may actually be going backwards, if we assume that that's
the bottleneck, and that we cannot parallelize things.

Anyway, attached rough test program implements what you outline. This
is for 30,000 32 byte strings (where just the final two bytes differ).
On my laptop, output looks like this (edited to only show median
duration in each case):

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 13.445506 seconds

duration of comparisons with useless memcmp()s: 17.720501 seconds
"""

It looks like about a 30% increase in run time when we always have
useless memcmps().

You can change the constants around easily - let's consider 64 KiB
strings now (by changing STRING_SIZE to 65536). In order to make the
program not take too long, I also reduce the number of strings
(N_STRINGS) from 30,000 to 1,000. If I do so, this is what I see as
output:

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 11.205683 seconds

duration of comparisons with useless memcmp()s: 14.542997 seconds
"""

It looks like the overhead here is surprisingly consistent with the
first run - again, about a 30% increase in run time.

As for 1MiB strings (this time, with an N_STRINGS of 300):

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 23.624183 seconds

duration of comparisons with useless memcmp()s: 35.332728 seconds
"""

So, at this scale, the overhead gets quite a bit worse, but the case
also becomes quite a bit less representative (if that's even
possible). I suspect that the call stack's stupidly large size may be
a problem here, but I didn't try and fix that.

Does this answer your question? Are you intent on extrapolating across
different platforms based on this test program, rather than looking at
real SQL queries? While a certain amount of that makes sense, I think
we should focus on queries that have some chance of being seen in real
production PostgreSQL instances. Failing that, actual SQL queries.

-- 
Peter Geoghegan
#include 
#include 
#include 
#include 
#include 

/* STRING_SIZE does not include NUL byte */
#define STRING_SIZE		32
#define N_STRINGS		3
#define LAST_N_DIFFER	2

#define INSTR_TIME_SUBTRACT(x,y) \
	do { \
		(x).tv_sec -= (y).tv_sec; \
		(x).tv_usec -= (y).tv_usec; \
		/* Normalize */ \
		while ((x).tv_usec < 0) \
		{ \
			(x).tv_usec += 100; \
			(x).tv_sec--; \
		} \
	} while (0)

#define INSTR_TIME_GET_DOUBLE(t) \
	(((double) (t).tv_sec) + ((double) (t).tv_usec) / 100.0)

#define Min(x, y)		((x) < (y) ? (x) : (y))

/*
 * Generate single random alphabetical ASCII char.  Uses an unseeded ran

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread k...@rice.edu
On Sat, Sep 13, 2014 at 12:55:33PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva  wrote:
> >>> That's not entirely true. CRC-32C beats pretty much everything with the 
> >>> same
> >>> length quality-wise and has both hardware implementations and highly
> >>> optimized software versions.
> 
> >> For better or for worse CRC is biased by detecting all single bit
> >> errors, the detection capability of larger errors is slightly
> >> diminished. The quality of the other algorithms I mentioned is also
> >> very good, while producing uniformly varying output.
> 
> > There's also much more literature about the various CRCs in comparison
> > to some of these hash allgorithms.
> 
> Indeed.  CRCs have well-understood properties for error detection.
> Have any of these new algorithms been analyzed even a hundredth as
> thoroughly?  No.  I'm unimpressed by evidence-free claims that
> something else is "also very good".
> 
> Now, CRCs are designed for detecting the sorts of short burst errors
> that are (or were, back in the day) common on phone lines.  You could
> certainly make an argument that that's not the type of threat we face
> for PG data.  However, I've not seen anyone actually make such an
> argument, let alone demonstrate that some other algorithm would be better.
> To start with, you'd need to explain precisely what other error pattern
> is more important to defend against, and why.
> 
>   regards, tom lane
> 

Here is a blog on the development of xxhash:

http://fastcompression.blogspot.com/2012/04/selecting-checksum-algorithm.html

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Atri Sharma
On Sat, Sep 13, 2014 at 11:52 PM, David G Johnston <
david.g.johns...@gmail.com> wrote:

> Atri Sharma wrote
> > On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal <
>
>
>
> Or rather even if you want to be able to reference the older versions of
> that record there is nothing in PostgreSQL to facilitate that. You have to
> manually create and manage the data so that you know during what time
> period
> a given record is valid.
>
> David J.
>
>
>
>
>
Sometimes I do miss 'time travel' we used to have :)

Regards,

Atri
-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread David G Johnston
Atri Sharma wrote
> On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal <

> rhtgyl.87@

> > wrote:
> 
>>
>> Hi All,
>>
>> I want to work on the code of intermediate dataset of select and update
>> query.
>>
>> For example.
>>
>> Rohit's salary has been updated 4 times, so it has 4 different version of
>> salary.
>>
>> I want to select  salary of person named Rohit. Now suppose , in
>> intermediate result, I found 4 different versions of the data. I want to
>> know the code portion which i need to look for working on all 4 versions
>> in
>> dataset. :)
>>
>> Thanks in advance!!
>>
>>
>>
> Not sure what you are looking for, but each update is an insert of a new
> tuple with the new values and marking the old tuple as deleted.
> 
> There is no need for tracking the versions of any changes in data set.
> Each
> update operation leaves only one visible tuple. If the transaction
> commits,
> inserted tuple becomes visible and old row is marked deleted. If the
> transaction rollbacks, only the old tuple shall remain visible.
> -- 
> Regards,
> 
> Atri
> *l'apprenant*

Or rather even if you want to be able to reference the older versions of
that record there is nothing in PostgreSQL to facilitate that. You have to
manually create and manage the data so that you know during what time period
a given record is valid.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Postgres-code-for-a-query-intermediate-dataset-tp5818931p5818935.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Atri Sharma
On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal  wrote:

>
> Hi All,
>
> I want to work on the code of intermediate dataset of select and update
> query.
>
> For example.
>
> Rohit's salary has been updated 4 times, so it has 4 different version of
> salary.
>
> I want to select  salary of person named Rohit. Now suppose , in
> intermediate result, I found 4 different versions of the data. I want to
> know the code portion which i need to look for working on all 4 versions in
> dataset. :)
>
> Thanks in advance!!
>
>
>
Not sure what you are looking for, but each update is an insert of a new
tuple with the new values and marking the old tuple as deleted.

There is no need for tracking the versions of any changes in data set. Each
update operation leaves only one visible tuple. If the transaction commits,
inserted tuple becomes visible and old row is marked deleted. If the
transaction rollbacks, only the old tuple shall remain visible.
-- 
Regards,

Atri
*l'apprenant*


[HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Rohit Goyal
Hi All,

I want to work on the code of intermediate dataset of select and update
query.

For example.

Rohit's salary has been updated 4 times, so it has 4 different version of
salary.

I want to select  salary of person named Rohit. Now suppose , in
intermediate result, I found 4 different versions of the data. I want to
know the code portion which i need to look for working on all 4 versions in
dataset. :)

Thanks in advance!!

Regards,
Rohit Goyal



-- 
Regards,
Rohit Goyal


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Tom Lane
Andres Freund  writes:
> On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
>> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva  wrote:
>>> That's not entirely true. CRC-32C beats pretty much everything with the same
>>> length quality-wise and has both hardware implementations and highly
>>> optimized software versions.

>> For better or for worse CRC is biased by detecting all single bit
>> errors, the detection capability of larger errors is slightly
>> diminished. The quality of the other algorithms I mentioned is also
>> very good, while producing uniformly varying output.

> There's also much more literature about the various CRCs in comparison
> to some of these hash allgorithms.

Indeed.  CRCs have well-understood properties for error detection.
Have any of these new algorithms been analyzed even a hundredth as
thoroughly?  No.  I'm unimpressed by evidence-free claims that
something else is "also very good".

Now, CRCs are designed for detecting the sorts of short burst errors
that are (or were, back in the day) common on phone lines.  You could
certainly make an argument that that's not the type of threat we face
for PG data.  However, I've not seen anyone actually make such an
argument, let alone demonstrate that some other algorithm would be better.
To start with, you'd need to explain precisely what other error pattern
is more important to defend against, and why.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Andres Freund
On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva  wrote:
> > That's not entirely true. CRC-32C beats pretty much everything with the same
> > length quality-wise and has both hardware implementations and highly
> > optimized software versions.
> 
> For better or for worse CRC is biased by detecting all single bit
> errors, the detection capability of larger errors is slightly
> diminished. The quality of the other algorithms I mentioned is also
> very good, while producing uniformly varying output.

There's also much more literature about the various CRCs in comparison
to some of these hash allgorithms. Pretty much everything tests how well
they're suited for hashtables, but that's not really what we need
(although it might not hurt *at all* to have something faster there...).

I do think we need to think about the types of errors we really have to
detect. It's not all that clear that either the typical guarantees/tests
for CRCs nor for checksums (smhasher, whatever) are very
representative...

> CRC has exactly
> one hardware implementation in general purpose CPU's and Intel has a
> patent on the techniques they used to implement it. The fact that AMD
> hasn't yet implemented this instruction shows that this patent is
> non-trivial to work around.

I think AMD has implemeded SSE4.2 with bulldozer. It's still only recent
x86 though. So I think there's good reasons for moving away from it.

How one could get patents on exposing hardware CRC implementations -
hard to find a computing device without one - as a instruction is beyond
me...

I think it's pretty clear by now that we should move to lz4 for a couple
things - which bundles xxhash with it. So that has one argument for it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] formatting.c

2014-09-13 Thread Bruce Momjian
On Fri, Sep 12, 2014 at 11:48:03PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Is everyone OK with me renaming some variables, structures, and macros? 
> > It will make back-patching harder, but will allow us to properly maintain
> > that file.
> 
> The back-patching problem could be addressed by back-patching the
> renaming.  If it's purely a mechanical thing, there should be minimal risk
> no?  I would claim that messed-up back-patches would have non-negligible
> risk too, so you can't say that not back-patching is clearly safer.

Uh, yes, it is mechanical and could be backpatched as well.

> Now, whether the renaming actually makes things any clearer is something
> I reserve judgment on.

What the code does now is call almost everything a "number", including
the format values, binary and string representations of the number, etc.
You can take a look at reverted commit
f68dc5d86b9f287f80f4417f5a24d876eb13771d to see an example of the
renaming;  that is probably only half the job.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED

2014-09-13 Thread Fabrízio de Royes Mello
On Tue, Aug 26, 2014 at 1:42 AM, Fabrízio de Royes Mello <
fabriziome...@gmail.com> wrote:
>
>
> On Fri, Aug 22, 2014 at 5:23 PM, Alvaro Herrera 
wrote:
> >
> > Fabrízio de Royes Mello wrote:
> > > On Fri, Aug 22, 2014 at 4:45 PM, Alvaro Herrera <
alvhe...@2ndquadrant.com>
> > > wrote:
> >
> > > > I pointed out, in the email just before pushing the patch, that
perhaps
> > > > we should pass down the new relpersistence flag into
finish_heap_swap,
> > > > and from there we could pass it down to reindex_index() which is
where
> > > > it would be needed.  I'm not sure it's worth the trouble, but I
think we
> > > > can still ask Fabrizio to rework that part.
> >
> > > I can rework this part if it's a real concern.
> >
> > I guess we can make a better assessment by seeing what it would take.
> > I'm afraid it will turn out to be really ugly.
> >
> > Now, there's some desire to have unlogged indexes on logged tables; I
> > guess if we have that, then eventually there will also be a desire to
> > swap individual indexes from logged to unlogged.  Perhaps whatever fix
> > we come up with here would pave the road for that future feature.
> >
>
> Álvaro,
>
> I did a refactoring to pass down the relpersistence to "finish_heap_swap"
and then change the catalog inside the "reindex_index" instead of in the
rewrite table phase.
>
> That way we can get rid the function ATChangeIndexesPersistence in the
src/backend/commands/tablecmds.c.
>
> Thoughts?
>

Patch rebased and added to commitfest [1].

Regards,

[1] https://commitfest.postgresql.org/action/commitfest_view?id=24

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ee10594..173f412 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1973,7 +1973,7 @@ index_build(Relation heapRelation,
 	 * created it, or truncated twice in a subsequent transaction, the
 	 * relfilenode won't change, and nothing needs to be done here.
 	 */
-	if (heapRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+	if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
 		!smgrexists(indexRelation->rd_smgr, INIT_FORKNUM))
 	{
 		RegProcedure ambuildempty = indexRelation->rd_am->ambuildempty;
@@ -3099,7 +3099,7 @@ IndexGetRelation(Oid indexId, bool missing_ok)
  * reindex_index - This routine is used to recreate a single index
  */
 void
-reindex_index(Oid indexId, bool skip_constraint_checks)
+reindex_index(Oid indexId, bool skip_constraint_checks, char relpersistence)
 {
 	Relation	iRel,
 heapRelation;
@@ -3160,6 +3160,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 			indexInfo->ii_ExclusionStrats = NULL;
 		}
 
+		/*
+		 * Check if need to set the new relpersistence
+		 */
+		if (iRel->rd_rel->relpersistence != relpersistence)
+			iRel->rd_rel->relpersistence = relpersistence;
+
 		/* We'll build a new physical relation for the index */
 		RelationSetNewRelfilenode(iRel, InvalidTransactionId,
   InvalidMultiXactId);
@@ -3358,11 +3364,19 @@ reindex_relation(Oid relid, int flags)
 		foreach(indexId, indexIds)
 		{
 			Oid			indexOid = lfirst_oid(indexId);
+			char		relpersistence = rel->rd_rel->relpersistence;
 
 			if (is_pg_class)
 RelationSetIndexList(rel, doneIndexes, InvalidOid);
 
-			reindex_index(indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS));
+			/* Check for flags to force UNLOGGED or PERMANENT persistence */
+			if (flags & REINDEX_REL_FORCE_UNLOGGED)
+relpersistence = RELPERSISTENCE_UNLOGGED;
+			else if (flags & REINDEX_REL_FORCE_PERMANENT)
+relpersistence = RELPERSISTENCE_PERMANENT;
+
+			reindex_index(indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
+		  relpersistence);
 
 			CommandCounterIncrement();
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ff80b09..f285026 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -588,7 +588,8 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
 	 */
 	finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
 	 swap_toast_by_content, false, true,
-	 frozenXid, cutoffMulti);
+	 frozenXid, cutoffMulti,
+	 OldHeap->rd_rel->relpersistence);
 }
 
 
@@ -1474,7 +1475,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
  bool check_constraints,
  bool is_internal,
  TransactionId frozenXid,
- MultiXactId cutoffMulti)
+ MultiXactId cutoffMulti,
+ char newrelpersistence)
 {
 	ObjectAddress object;
 	Oid			mapped_tables[4];
@@ -1518,6 +1520,12 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
 	if (check_constraints)
 		reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+
+	if (n

Re: [HACKERS] documentation update for doc/src/sgml/func.sgml

2014-09-13 Thread Fabien COELHO



Of course a general rule how to link to WP would be nice ...


I'm afraid that the current implicit rule is more or less "no links", at
least there are very few of them but in the glossary, and when I submitted
docs with them they were removed before committing.


Ideally if external links were to be allowed PostgreSQL.org would maintain a
URL shortening service and those shortened links would be in the
documentation so that any necessary pointers could be changed immediately
and without source updates.


Why not. This would mean maintaining the redirection table somewhere. That 
would also be a place to check whether some links become dandling 
pointers. However there are also implications on the infrastructure to 
host these redirections. Maybe one page on wiki.postgresql.org could be 
used for this purpose.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] documentation update for doc/src/sgml/func.sgml

2014-09-13 Thread David G Johnston
Fabien COELHO-3 wrote
>> Of course a general rule how to link to WP would be nice ...
> 
> I'm afraid that the current implicit rule is more or less "no links", at 
> least there are very few of them but in the glossary, and when I submitted 
> docs with them they were removed before committing.

Ideally if external links were to be allowed PostgreSQL.org would maintain a
URL shortening service and those shortened links would be in the
documentation so that any necessary pointers could be changed immediately
and without source updates.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/documentation-update-for-doc-src-sgml-func-sgml-tp5815621p5818924.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-09-13 Thread Alexander Korotkov
On Tue, Aug 19, 2014 at 2:02 PM, David Rowley  wrote:

> Here's a few notes from reading over the code:
>
> * pathkeys.c
>
>   EquivalenceMember *member = (EquivalenceMember *)
>  lfirst(list_head(key->pk_eclass->ec_members));
>
> You can use linitial() instead of lfirst(list_head()). The same thing
> occurs in costsize.c
>

Fixed.


> * pathkeys.c
>
> The following fragment:
>
> n = pathkeys_common(root->query_pathkeys, pathkeys);
>
> if (n != 0)
> {
> /* It's useful ... or at least the first N keys are */
>  return n;
> }
>
> return 0; /* path ordering not useful */
> }
>
> Could just read:
>
> /* return the number of path keys in common, or 0 if there are none */
>  return pathkeys_common(root->query_pathkeys, pathkeys);
>

Fixed.


> * execnodes.h
>
> In struct SortState, some new fields don't have a comment.
>

Fixed.


> I've also thrown a few different workloads at the patch and I'm very
> impressed with most of the results. Especially when LIMIT is used, however
> I've found a regression case which I thought I should highlight, but for
> now I can't quite see what could be done to fix it.
>
> create table a (x int not null, y int not null);
> insert into a select x.x,y.y from generate_series(1,100) x(x) cross
> join generate_series(1,10) y(y);
>
> Patched:
> explain analyze select x,y from a where x+0=1 order by x,y limit 10;
>  QUERY PLAN
>
> 
>  Limit  (cost=92.42..163.21 rows=10 width=8) (actual
> time=6239.426..6239.429 rows=10 loops=1)
>->  Partial sort  (cost=92.42..354064.37 rows=5 width=8) (actual
> time=6239.406..6239.407 rows=10 loops=1)
>  Sort Key: x, y
>  Presorted Key: x
>  Sort Method: quicksort  Memory: 25kB
>  ->  Index Scan using a_x_idx on a  (cost=0.44..353939.13
> rows=5 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
>Filter: ((x + 0) = 1)
>Rows Removed by Filter: 990
>  Planning time: 0.212 ms
>  Execution time: 6239.505 ms
> (10 rows)
>
>
> Time: 6241.220 ms
>
> Unpatched:
> explain analyze select x,y from a where x+0=1 order by x,y limit 10;
>  QUERY PLAN
>
> 
>  Limit  (cost=195328.26..195328.28 rows=10 width=8) (actual
> time=3077.759..3077.761 rows=10 loops=1)
>->  Sort  (cost=195328.26..195453.26 rows=5 width=8) (actual
> time=3077.757..3077.757 rows=10 loops=1)
>  Sort Key: x, y
>  Sort Method: quicksort  Memory: 25kB
>  ->  Seq Scan on a  (cost=0.00..194247.77 rows=5 width=8)
> (actual time=0.018..3077.705 rows=10 loops=1)
>Filter: ((x + 0) = 1)
>Rows Removed by Filter: 990
>  Planning time: 0.510 ms
>  Execution time: 3077.837 ms
> (9 rows)
>
>
> Time: 3080.201 ms
>
> As you can see, the patched version performs an index scan in order to get
> the partially sorted results, but it does end up quite a bit slower than
> the seqscan/sort that the unpatched master performs. I'm not quite sure how
> realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if
> something like x+y = 1 was performed too After a bit more analysis on
> this, I see that if I change the 50k estimate to 10 in the debugger that
> the num_groups is properly estimated at 1 and it then performs the seq scan
> instead. So it looks like the costings of the patch are not to blame here.
> (The 50k row estimate comes from rel tuples  / DEFAULT_NUM_DISTINCT)
>

Yes, the error comes from assumption of 50k row estimate. I've checked
similar example when estimate is fine.

create table b as (select x.x,y.y,x.x z from generate_series(1,100)
x(x) cross join generate_series(1,10) y(y));
create index b_x_idx on b(x);
analyze b;

There is column z which is both not in index and not in "order by" clause.
If we replace "x+0=1" with "z=1" optimizer didn't decide to use partial
sort.

explain analyze select x,y,z from b where z=1 order by x,y limit 10;
QUERY PLAN
──
 Limit  (cost=179056.59..179056.61 rows=10 width=12) (actual
time=1072.498..1072.500 rows=10 loops=1)
   ->  Sort  (cost=179056.59..179056.63 rows=18 width=12) (actual
time=1072.495..1072.495 rows=10 loops=1)
 Sort Key: x, y
 Sort Method: quicksort  Memory: 25kB
 ->  Seq Scan on b  (cost=0.00..179056.21 rows=18 width=12) (actual
time=0.020..1072.454 rows=10 loops=1)
   Filter: (z = 1)
   Rows Removed by Filter: 990
 Planning time: 0.501 ms
 Execution time: 1072.555 ms
(9 rows)

If we event force optimizer to 

Re: [HACKERS] documentation update for doc/src/sgml/func.sgml

2014-09-13 Thread Fabien COELHO



Attached is an updated version of the patch.


Ok.

I notice that you decided against adding tags around function and type 
names.



It's really not about the IEEE changing something, but about someone
changing the Wikipedia page. The way I linked it makes sure it always
displays the same version of the page.


I understood that. I think that the likelyhood of someone removing the 
section about rounding in the IEEE standard is very low, and the current 
php link is pretty ugly from a REST perspective, and would prevent to see 
the possible improved version of the page. Well, no big deal.



Of course a general rule how to link to WP would be nice ...


I'm afraid that the current implicit rule is more or less "no links", at 
least there are very few of them but in the glossary, and when I submitted 
docs with them they were removed before committing.


However I do support the idea that these links are useful references and 
should be put where appropriate, even if it means that there must be some 
updates from time to time.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-13 Thread Fabien COELHO



[about logging...]


Here is an attempt at updating the log features, including the aggregate 
and sampling stuff, with skipped transactions under throttling.


I moved the logging stuff into a function which is called when a 
transaction is skipped or finished.


From a log file format perspective, I think that "-" would be better than 

"skipped".

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 087e0d3..72f00d8 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,18 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * Transactions which take longer that this limit are counted as late
+ * and reported as such, although they are completed anyway.
+ *
+ * When under throttling: execution time slots which are more than
+ * this late (in us) are simply skipped, and the corresponding transaction
+ * is counted as such... it is not even started;
+ * otherwise above the limit transactions are counted as such, with the latency
+ * measured wrt the transaction schedule, not its actual start.
+ */
+int64		latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +250,8 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
+	int64		latency_late; /* late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +264,8 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
+	int64		latency_late;
 } TResult;
 
 /*
@@ -295,6 +311,8 @@ typedef struct
 	double		max_lag;
 	double		sum_lag;		/* sum(lag) */
 	double		sum2_lag;		/* sum(lag*lag) */
+
+	int			skipped;		/* skipped transactions under --rate --limit */
 } AggVals;
 
 static Command **sql_files[MAX_FILES];	/* SQL script files */
@@ -372,6 +390,10 @@ usage(void)
 		 "  -f, --file=FILENAME  read transaction script from FILENAME\n"
 		   "  -j, --jobs=NUM   number of threads (default: 1)\n"
 		   "  -l, --logwrite transaction times to log file\n"
+		   "  -L, --limit=NUM  count transactions lasting more than NUM ms.\n"
+		   "   under throttling (--rate), transactions behind schedule\n"
+		   "   more than NUM ms are skipped, and those finishing more\n"
+		   "   than NUM ms after their scheduled start are counted.\n"
 		   "  -M, --protocol=simple|extended|prepared\n"
 		   "   protocol for submitting queries (default: simple)\n"
 		   "  -n, --no-vacuum  do not run VACUUM before tests\n"
@@ -991,13 +1013,13 @@ void
 agg_vals_init(AggVals *aggs, instr_time start)
 {
 	/* basic counters */
-	aggs->cnt = 0;/* number of transactions */
+	aggs->cnt = 0;/* number of transactions (some were possibly skipped) */
 	aggs->sum_latency = 0;		/* SUM(latency) */
 	aggs->sum2_latency = 0;/* SUM(latency*latency) */
 
 	/* min and max transaction duration */
-	aggs->min_latency = 0;
-	aggs->max_latency = 0;
+	aggs->min_latency = -1.0;
+	aggs->max_latency = -1.0;
 
 	/* schedule lag counters */
 	aggs->sum_lag = 0;
@@ -1005,10 +1027,173 @@ agg_vals_init(AggVals *aggs, instr_time start)
 	aggs->min_lag = 0;
 	aggs->max_lag = 0;
 
+	/* skipped transactions under --rate & --limit */
+	aggs->skipped = 0;
+
 	/* start of the current interval */
 	aggs->start_time = INSTR_TIME_GET_DOUBLE(start);
 }
 
+/* generate log
+ */
+static void doLog(TState *thread, CState *st, FILE *logfile, AggVals *agg, bool skipped)
+{
+	/*
+	 * write the log entry if this row belongs to the random sample,
+	 * or no sampling rate was given which means log everything.
+	 */
+	if (sample_rate == 0.0 ||
+		pg_erand48(thread->random_state) <= sample_rate)
+	{
+		double		lag;
+		double		latency;
+		instr_time	now;
+
+		INSTR_TIME_SET_CURRENT(now);
+
+		latency = skipped? -1.0: (double) (INSTR_TIME_GET_MICROSEC(now) - st->txn_scheduled);
+		lag = (double) (INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled);
+
+		/* should we aggregate the results or not? */
+		if (agg_interval > 0)
+		{
+			/*
+			 * are we still in the same interval? if yes, accumulate
+			 * the values (print them otherwise)
+			 */
+			if (agg->start_time + agg_interval >= INSTR_TIME_GET_DOUBLE(now))
+			{
+agg->cnt += 1;
+if (skipped)
+	agg->skipped += 1;
+else
+{
+	/* there is no latency to record if the transaction was skipped */
+	agg->sum_latency += latency;
+	agg->sum2_latency += latency * latency;
+
+	/* first non-skipped in this aggregation interval */
+	if ((agg->min_latency == -1.0) || (latency < agg->min_latency))
+		agg->min_latency = latency;
+
+	if ((agg->max_latency == -1.0) || (latency > agg->max_latency