from:"Pavan Deolasee"

[HACKERS] WAL log only necessary part of 2PC GID

2016-02-29 Thread Pavan Deolasee

Hello Hackers,

The maximum size of the GID, used as a 2PC identifier is currently defined
as 200 bytes (see src/backend/access/transam/twophase.c). The actual GID
used by the applications though may be much smaller than that. So IMO
instead of WAL logging the entire 200 bytes during PREPARE TRANSACTION, we
should just WAL log strlen(gid) bytes.

The attached patch does that. The changes are limited to twophase.c and
some simple crash recovery tests seem to be work ok. In terms of
performance, a quick test shows marginal improvement in tps using the
script that Stas Kelvich used for his work on speeding up twophase
transactions. The only change I made is to keep the :scale unchanged
because increasing the :scale in every iteration will result in only a
handful updates (not sure why Stas had that in his original script)

\set naccounts 10 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid =
:from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid =
:to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';

The amount of WAL generated during a 60s run shows a decline of about 25%
with default settings except full_page_writes which is turned off.

HEAD: 861 WAL bytes / transaction
PATCH: 670 WAL bytes / transaction

Actually, the above numbers probably include a lot of WAL generated because
of HOT pruning and page defragmentation. If we just look at the WAL
overhead caused by 2PC, the decline is somewhere close to 50%. I took
numbers using simple 1PC for reference and to understand the overhead of
2PC.

HEAD (1PC): 382 bytes / transaction

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


reduce_gid_wal.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL log only necessary part of 2PC GID

2016-03-08 Thread Pavan Deolasee

On Fri, Mar 4, 2016 at 9:16 PM, Jesper Pedersen 
wrote:

>
>>
>>
> I can confirm the marginal speed up in tps due to the new WAL size.
>
> The TWOPHASE_MAGIC constant should be changed, as the file header has
> changed definition, right ?
>

Thanks for looking at it. I've revised the patch by incrementing the
TWOPHASE_MAGIC identifier.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


reduce_gid_wal_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL log only necessary part of 2PC GID

2016-03-10 Thread Pavan Deolasee

On Wed, Mar 9, 2016 at 7:56 PM, Petr Jelinek  wrote:

> Hi,
>
> I wonder why you define the gidlen as uint32 when it would fit into uint8
> which in the current TwoPhaseFileHeader struct should be win of  8 bytes on
> padding (on 64bit). I think that's something worth considering given that
> this patch aims to lower the size of the data.
>
>
Hi Petr,

That sounds like a good idea; I didn't think about that. I would like to
make it uint16 though just in case if we decide to increase GIDSIZE from
200 to something more than 256 (Postgres-XL does that already). That still
fits in the same aligned width, on both 32 as well as 64-bit machines. New
version attached.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

reduce_gid_wal_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

2016-03-23 Thread Pavan Deolasee

On Wed, Mar 23, 2016 at 1:13 PM, Michael Paquier 
wrote:

>
> +   /*
> +* Compute targetRecOff. It should typically be greater than short
> +* page-header since a valid record can't , but can also be zero
> when
> +* caller has supplied a page-aligned address or when we are
> skipping
> +* multi-page continuation record. It doesn't matter though because
> +* ReadPageInternal() will read at least short page-header worth of
> +* data
> +*/
> This needs some polishing, there is an unfinished sentence here.
>
> +   targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
> targetRecOff, pageHeaderSize and targetPagePtr could be declared
> inside directly the new while loop.
>

Thanks Michael for reviewing the patch. I've fixed these issues and new
version is attached.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


xlogdump_multipage_cont_record_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

2016-03-28 Thread Pavan Deolasee

On Wed, Mar 23, 2016 at 6:16 PM, Michael Paquier 
wrote:

>
>
> I'd just add dots at the end of the sentences in the comment blocks
> because that's project style, but I'm being picky, except that the
> logic looks quite good.
>

Since this is a bug affecting all stable branches, IMHO it will be a good
idea to fix this before the upcoming minor releases.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

2016-03-30 Thread Pavan Deolasee

On Thu, Mar 31, 2016 at 6:27 AM, Craig Ringer  wrote:
>
>
> Can you describe the process used to generate the sample WAL segment?
>
>
Shame that I can't find the sql file used to create the problematic WAL
segment. But this is what I did.

I wrote a plpgsql function which inserts rows in a loop, every time
checking if the remaining space in the WAL segment  has fallen to less than
couple of hundred bytes. Of course, I used pg_settings to get the WAL
segment size, WAL page size and pg_current_xlog_insert_location() to
correctly compute remaining bytes in the WAL segment. At this point, do a
non-HOT update, preferably to table which is already fully vacuumed and
CHECKPOINTed to avoid getting any other WAL records in between. Assuming
FPW is turned ON, the UPDATE should generate enough WAL to cross over the
first page in the new WAL segment.

Let me know if you would like to me to put together a sql based on this
description.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-08-31 Thread Pavan Deolasee

On Wed, Aug 31, 2016 at 10:38 PM, Claudio Freire 
wrote:

>
> On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee 
> wrote:
>
>> We discussed a few ideas to address the "Duplicate Scan" problem. For
>> example, we can teach Index AMs to discard any duplicate (key, CTID) insert
>> requests. Or we could guarantee uniqueness by either only allowing updates
>> in one lexical order. While the former is a more complete solution to avoid
>> duplicate entries, searching through large number of keys for non-unique
>> indexes could be a drag on performance. The latter approach may not be
>> sufficient for many workloads. Also tracking increment/decrement for many
>> indexes will be non-trivial.
>>
>> There is another problem with allowing many index entries pointing to the
>> same WARM chain. It will be non-trivial to know how many index entries are
>> currently pointing to the WARM chain and index/heap vacuum will throw up
>> more challenges.
>>
>> Instead, what I would like to propose and the patch currently implements
>> is to restrict WARM update to once per chain. So the first non-HOT update
>> to a tuple or a HOT chain can be a WARM update. The chain can further be
>> HOT updated any number of times. But it can no further be WARM updated.
>> This might look too restrictive, but it can still bring down the number of
>> regular updates by almost 50%. Further, if we devise a strategy to convert
>> a WARM chain back to HOT chain, it can again be WARM updated. (This part is
>> currently not implemented). A good side effect of this simple strategy is
>> that we know there can maximum two index entries pointing to any given WARM
>> chain.
>>
>
> We should probably think about coordinating with my btree patch.
>
> From the description above, the strategy is quite readily "upgradable" to
> one in which the indexam discards duplicate (key,ctid) pairs and that would
> remove the limitation of only one WARM update... right?
>

Yes, we should be able to add further optimisations on lines you're working
on, but what I like about the current approach is that a) it reduces
complexity of the patch and b) having thought about cleaning up WARM
chains, limiting number of index entries per root chain to a small number
will simplify that aspect too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-09-01 Thread Pavan Deolasee

On Thu, Sep 1, 2016 at 1:33 AM, Bruce Momjian  wrote:

> On Wed, Aug 31, 2016 at 10:15:33PM +0530, Pavan Deolasee wrote:
> > Instead, what I would like to propose and the patch currently implements
> is to
> > restrict WARM update to once per chain. So the first non-HOT update to a
> tuple
> > or a HOT chain can be a WARM update. The chain can further be HOT
> updated any
> > number of times. But it can no further be WARM updated. This might look
> too
> > restrictive, but it can still bring down the number of regular updates by
> > almost 50%. Further, if we devise a strategy to convert a WARM chain
> back to
> > HOT chain, it can again be WARM updated. (This part is currently not
> > implemented). A good side effect of this simple strategy is that we know
> there
> > can maximum two index entries pointing to any given WARM chain.
>
> I like the simplified approach, as long as it doesn't block further
> improvements.
>
>
Yes, the proposed approach is simple yet does not stop us from improving
things further. Moreover it has shown good performance characteristics and
I believe it's a good first step.


>
> > Master:
> > tps = 1138.072117 (including connections establishing)
> >
> > WARM:
> > tps = 2016.812924 (including connections establishing)
>
> These are very impressive results.
>
>
Thanks. What's also interesting and something that headline numbers don't
show is that WARM TPS is as much as 3 times of master TPS when the
percentage of WARM updates is very high. Notice the spike in TPS in the
comparison graph.

Results with non-default heap fill factor are even better. In both cases,
the improvement in TPS stays constant over long periods.


>
> >
> > During first heap scan of VACUUM, we look for tuples with
> HEAP_WARM_TUPLE set.
> > If all live tuples in the chain are either marked with Blue flag or Red
> flag
> > (but no mix of Red and Blue), then the chain is a candidate for HOT
> conversion.
>
> Uh, if the chain is all blue, then there is are WARM entries so it
> already a HOT chain, so there is nothing to do, right?
>

For aborted WARM updates, the heap chain may be all blue, but there may
still be a red index pointer which must be cleared before we allow further
WARM updates to the chain.


>
> > We remember the root line pointer and Red-Blue flag of the WARM chain in
> a
> > separate array.
> >
> > If we have a Red WARM chain, then our goal is to remove Blue pointers
> and vice
> > versa. But there is a catch. For Index2 above, there is only Blue pointer
> > and that must not be removed. IOW we should remove Blue pointer iff a Red
> > pointer exists. Since index vacuum may visit Red and Blue pointers in any
> > order, I think we will need another index pass to remove dead
> > index pointers. So in the first index pass we check which WARM
> candidates have
> > 2 index pointers. In the second pass, we remove the dead pointer and
> reset Red
> > flag is the surviving index pointer is Red.
>
> Why not just remember the tid of chains converted from WARM to HOT, then
> use "amrecheck" on an index entry matching that tid to see if the index
> matches one of the entries in the chain.


That will require random access to heap during index vacuum phase,
something I would like to avoid. But we can have that as a fall back
solution for handling aborted vacuums.



> (It will match all of them or
> none of them, because they are all red.)  I don't see a point in
> coloring the index entries as reds as later you would have to convert to
> blue in the WARM-to-HOT conversion, and a vacuum crash could lead to
> inconsistencies.


Yes, that's a concern since the conversion of red to blue will also need to
WAL logged to ensure that a crash doesn't leave us in inconsistent state. I
still think that this will be an overall improvement as compared to
allowing one WARM update per chain.


> Consider that you can just call "amrecheck" on the few
> chains that have converted from WARM to HOT.  I believe this is more
> crash-safe too.  However, if you have converted WARM to HOT in the heap,
> but crash durin the index entry removal, you could potentially have
> duplicates in the index later, which is bad.
>
>
As you probably already noted, we clear heap flags only after all indexes
are cleared of duplicate entries and hence a crash in between should not
cause any correctness issue. As long as heap tuples are marked as warm,
amrecheck will ensure that only valid tuples are returned to the caller.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-05 Thread Pavan Deolasee

On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra 
wrote:

> Hi,
>
> This thread started a year ago, different people contributed various
> patches, some of which already got committed. Can someone please post a
> summary of this thread, so that it's a bit more clear what needs
> review/testing, what are the main open questions and so on?
>
> I'm interested in doing some tests on the hardware I have available, but
> I'm not willing spending my time untangling the discussion.
>
>
I signed up for reviewing this patch. But as Amit explained later, there
are two different and independent implementations to solve the problem.
Since Tomas has volunteered to do some benchmarking, I guess I should wait
for the results because that might influence which approach we choose.

Does that sound correct? Or do we already know which implementation is more
likely to be pursued, in which case I can start reviewing that patch.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Refactoring of heapam code.

2016-09-05 Thread Pavan Deolasee

On Mon, Aug 8, 2016 at 3:13 PM, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

>
>>
> Thank you for the review, I'll fix these problems in final version.
>
> Posting the first message I intended to raise the discussion. Patches
> were attached mainly to illustrate the problem and to be more concrete.
>

I started looking at the first patch posted above, but it seems you'll
rewrite these patches after discussion on the heapam refactoring ideas you
posted here https://wiki.postgresql.org/wiki/HeapamRefactoring concludes.

Some comments anyways on the first patch:

1. It does not apply cleanly with git-apply - many white space errors
2. I don't understand the difference between PageGetItemHeapHeaderOnly()
and PageGetItemHeap(). They seem to do exactly the same thing and can be
used interchangeably.
3. While I like the idea of using separate interfaces to get heap/index
tuple from a page, may be they can internally use a common interface
instead of duplicating what PageGetItem() does already.
4. Should we rename PageGetItemHeap() to PageGetHeapTuple() or something
like that?
5. If we do this, we should probably have corresponding wrapper
functions/macros for remaining callers of PageGetItem()

I also looked at the refactoring design doc. Looks like a fine approach to
me, but the API will need much more elaborate discussions. I am not sure if
the interfaces as presented are enough to support everything that even
heapam can do today. And much more thoughts will be required to ensure we
don't miss out things that new primary access method may need.

A few points regarding how the proposed API maps to heapam:

- How do we support parallel scans on heap?
- Does the primary AM need to support locking of rows?
- Would there be separate API to interpret the PAM tuple itself? (something
that htup_details.h does today). It seems natural that every PAM will have
its own interpretation of tuple structure and we would like to hide that
inside PAM implementation.
- There are many upper modules that need access to system attributes (xmin,
xmax for starters). How do you plan to handle that? You think we can
provide enough abstraction so that the callers don't need to know the tuple
structures of individual PAM?

I don't know what to do with the CF entry itself. I could change the status
to "waiting on author" but then the design draft may not get enough
attention. So I'm leaving it in the current state for others to look at.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Notice lock waits

2016-09-06 Thread Pavan Deolasee

On Fri, Aug 5, 2016 at 10:30 PM, Jeff Janes  wrote:

>
>
> A general facility for promoting selected LOG messages to NOTICE would
> be nice, but I don't know how to design or implement that.  This is
> much easier, and I find it quite useful.
>
>
IMHO that's what we need and it will benefit many more users instead of
adding a new GUC every time.

FWIW I recently wrote a patch for Postgres-XL to do exactly this and I
found it very useful, especially while debugging race conditions and
problems with ongoing sessions. Sorry, I don't mean to hijack this thread,
will post that patch as a separate thread.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Override compile time log levels of specific messages/modules

2016-09-06 Thread Pavan Deolasee

While reviewing Jeff's notice_lock_waits patch, I came across his comment
about having a general facility for promoting selected LOG messages. So I
thought I should post it here, even though the patch is probably far from
being accepted in Postgres.

I recently wrote a patch for Postgres-XL to do exactly this and I found it
very useful, especially while debugging race conditions and problems with
ongoing sessions. I'm attaching a patch rebased on PG-master. It works, but
TBH I don't think it's anywhere close for being acceptable in PG. But I
hope this is good enough to show how this can be done with minimal changes
and spark ideas.

The patch uses some preprocessing and scripting magic to assign distinct
identifiers to each module (a subdir in the source code), to each file and
to each elog message. It then provides a set of functions by which an user
can increase/decrease/set log levels for either individual messages or all
messages within a source file or source module. The log levels can be
changed only for specific backends or all current or future backends. If
you configure with --enable-genmsgids switch, a MSGMODULES and MSGIDS file
is created in $srcdir, which can later be used to know ids assigned to
various modules/messages.

Now there are several problems with the patch:

- I don't know if it will work for anything other than clang and gcc (I've
tested those two, but only specific versions)
- The shared memory representation for msgs is not at all optimal and with
a large value of max_connections, it can quickly go out of control. But I
believe we can find a better representation without losing runtime
performance.
- The APIs themselves can be significantly improved.
- MSGMODULES and MSGIDS should probably be some sort of a view that user
can readily access without requiring access to the environment where it was
built.
- Right now it only supports changing levels for DEBUG[1-5] and LOG
messages. But that could be easily extended to cover INFO or NOTICE.
- The patch should probably have many more comments
- Finally, I don't know if better and more elegant ways exist to
automatically assigning module/file/msgids. I couldn't think of any without
making excessive changes to all source files and hence did what I did. That
does not mean better ways don't exists.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_msgids.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Override compile time log levels of specific messages/modules

2016-09-06 Thread Pavan Deolasee

On Tue, Sep 6, 2016 at 3:06 PM, Craig Ringer 
wrote:

>
> I think it's worth looking at how Java handles logging. We can't achieve
> an exact parallel in C as we don't really have a class hierarchy ... but we
> do have subsystems roughly grouped by file and directory structure.
>
> Sure. In some large, popular enterprise softwares I've worked with many
years ago, we used to define modules within each source file and then
manually assign distinct integer IDs to each message and then provide
various utilities to turn on/off specific messages or range of messages
within a module.

> Being able to promote/demote these or selective lower the log level
> threshold on a directory, file, function and line level would be
> exceedingly handy. Pretty much all of that can be magic'd up from macro
> output so hopefully no postprocessing should be needed. (Though only gcc
> has _FUNC_ or _FUNCTION_ I think so we'd have selective support there.)
>
Well, __FUNCTION__  expands to function name and lookup based on string
identifiers will always be costly, especially if you want to sprinkle the
code with lot many debug messages. Same with __FILE__. I believe we need an
unique integer constant to make it a fast, O(1) lookup. I couldn't find any
other method to do that when I wrote the facility and hence did what I did.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Bug in two-phase transaction recovery

2016-09-08 Thread Pavan Deolasee

On Thu, Sep 8, 2016 at 12:13 PM, Michael Paquier 
wrote:

> On Wed, Sep 7, 2016 at 10:48 PM, Stas Kelvich 
> wrote:
> > Some time ago two-phase state file format was changed to have variable
> size GID,
> > but several places that read that files were not updated to use new
> offsets. Problem
> > exists in master and 9.6 and can be reproduced on prepared transactions
> with
> > savepoints.
>
> Oops and meh. This meritates an open item, and has better be fixed by
> 9.6.0. I am glad you noticed that honestly. And we had better take
> care of this issue as soon as possible.
>
>
Indeed, it's a bug. Thanks Stas for tracking it down and Michael for the
review and checking other places. I also looked at the patch and it seems
fine to me. FWIW I looked at all other places where TwoPhaseFileHeader is
referred and they look safe to me.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee

On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire 
wrote:

> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark  wrote:
> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs 
> wrote:
> >> On 6 September 2016 at 19:59, Tom Lane  wrote:
> >>
> >>> The idea of looking to the stats to *guess* about how many tuples are
> >>> removable doesn't seem bad at all.  But imagining that that's going to
> be
> >>> exact is folly of the first magnitude.
> >>
> >> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
> >> so I think we agree that a reasonably accurate, yet imprecise
> >> calculation is possible in most cases.
> >
> > That would all be well and good if it weren't trivial to do what
> > Robert suggested. This is just a large unsorted list that we need to
> > iterate throught. Just allocate chunks of a few megabytes and when
> > it's full allocate a new chunk and keep going. There's no need to get
> > tricky with estimates and resizing and whatever.
>
> I agree. While the idea of estimating the right size sounds promising
> a priori, considering the estimate can go wrong and over or
> underallocate quite severely, the risks outweigh the benefits when you
> consider the alternative of a dynamic allocation strategy.
>
> Unless the dynamic strategy has a bigger CPU impact than expected, I
> believe it's a superior approach.
>
>
How about a completely different representation for the TID array? Now this
is probably not something new, but I couldn't find if the exact same idea
was discussed before. I also think it's somewhat orthogonal to what we are
trying to do here, and will probably be a bigger change. But I thought I'll
mention since we are at the topic.

What I propose is to use a simple bitmap to represent the tuples. If a
tuple at  is dead then the corresponding bit in the bitmap
is set. So clearly the searches through dead tuples are O(1) operations,
important for very large tables and large arrays.

Challenge really is that a heap page can theoretically have MaxOffsetNumber
of line pointers (or to be precise maximum possible offset number). For a
8K block, that comes be about 2048. Having so many bits per page is neither
practical nor optimal. But in practice the largest offset on a heap page
should not be significantly greater than MaxHeapTuplesPerPage, which is a
more reasonable value of 291 on my machine. Again, that's with zero sized
tuple and for real life large tables, with much wider tuples, the number
may go down even further.

So we cap the offsets represented in the bitmap to some realistic value,
computed by looking at page density and then multiplying it by a small
factor (not more than two) to take into account LP_DEAD and LP_REDIRECT
line pointers. That should practically represent majority of the dead
tuples in the table, but we then keep an overflow area to record tuples
beyond the limit set for per page. The search routine will do a direct
lookup for offsets less than the limit and search in the sorted overflow
area for offsets beyond the limit.

For example, for a table with 60 bytes wide tuple (including 24 byte
header), each page can approximately have 8192/60 = 136 tuples. Say we
provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
bitmap. First 272 offsets in every page are represented in the bitmap and
anything greater than are in overflow region. On the other hand, the
current representation will need about 16 bytes per page assuming 2% dead
tuples, 40 bytes per page assuming 5% dead tuples and 80 bytes assuming 10%
dead tuples. So bitmap will take more space for small tuples or when vacuum
is run very aggressively, both seems unlikely for very large tables. Of
course the calculation does not take into account the space needed by the
overflow area, but I expect that too be small.

I guess we can make a choice between two representations at the start
looking at various table stats. We can also be smart and change from bitmap
to traditional representation as we scan the table and see many more tuples
in the overflow region than we provisioned for. There will be some
challenges in converting representation mid-way, especially in terms of
memory allocation, but I think those can be sorted out if we think that the
idea has merit.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee

On Thu, Sep 8, 2016 at 8:42 PM, Claudio Freire 
wrote:

> On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
>  wrote:
> > For example, for a table with 60 bytes wide tuple (including 24 byte
> > header), each page can approximately have 8192/60 = 136 tuples. Say we
> > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> > bitmap. First 272 offsets in every page are represented in the bitmap and
> > anything greater than are in overflow region. On the other hand, the
> current
> > representation will need about 16 bytes per page assuming 2% dead
> tuples, 40
> > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> > tuples. So bitmap will take more space for small tuples or when vacuum is
> > run very aggressively, both seems unlikely for very large tables. Of
> course
> > the calculation does not take into account the space needed by the
> overflow
> > area, but I expect that too be small.
>
> I thought about something like this, but it could be extremely
> inefficient for mostly frozen tables, since the bitmap cannot account
> for frozen pages without losing the O(1) lookup characteristic
>

Well, that's correct. But I thought the whole point is when there are large
number of dead tuples which requires large memory. If my math was correct
as explained above, then even at 5% dead tuples, bitmap representation will
consume approximate same memory but provide O(1) search time.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee

On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada 
wrote:

>
>
> Making the vacuum possible to choose between two data representations
> sounds good.
> I implemented the patch that changes dead tuple representation to bitmap
> before.
> I will measure the performance of bitmap representation again and post
> them.

Sounds great! I haven't seen your patch, but what I would suggest is to
compute page density (D) = relpages/(dead+live tuples) and experiment with
bitmap of sizes of D to 2D bits per page. May I also suggest that instead
of putting in efforts in implementing the overflow area,  just count how
many dead TIDs would fall under overflow area for a given choice of bitmap
size.

It might be a good idea to experiment with different vacuum scale factor,
varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a
longish pgbench test on a large table and then save the data directory for
repeated experiments, although I'm not sure if pgbench will be a good
choice because HOT will prevent accumulation of dead pointers, in which
case you may try adding another index on abalance column.

It'll be worth measuring memory consumption of both representations as well
as performance implications on index vacuum. I don't expect to see any
major difference in either heap scans.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Block level parallel vacuum WIP

2016-09-10 Thread Pavan Deolasee

On Wed, Aug 24, 2016 at 3:31 AM, Michael Paquier 
wrote:

> On Tue, Aug 23, 2016 at 10:50 PM, Amit Kapila 
> wrote:
> > On Tue, Aug 23, 2016 at 6:11 PM, Michael Paquier
> >  wrote:
> >> On Tue, Aug 23, 2016 at 8:02 PM, Masahiko Sawada 
> wrote:
> >>> As for PoC, I implemented parallel vacuum so that each worker
> >>> processes both 1 and 2 phases for particular block range.
> >>> Suppose we vacuum 1000 blocks table with 4 workers, each worker
> >>> processes 250 consecutive blocks in phase 1 and then reclaims dead
> >>> tuples from heap and indexes (phase 2).
> >>
> >> So each worker is assigned a range of blocks, and processes them in
> >> parallel? This does not sound performance-wise. I recall Robert and
> >> Amit emails on the matter for sequential scan that this would suck
> >> performance out particularly for rotating disks.
> >>
> >
> > The implementation in patch is same as we have initially thought for
> > sequential scan, but turned out that it is not good way to do because
> > it can lead to inappropriate balance of work among workers.  Suppose
> > one worker is able to finish it's work, it won't be able to do more.
>
> Ah, so it was the reason. Thanks for confirming my doubts on what is
> proposed.
> --
>

I believe Sawada-san has got enough feedback on the design to work out the
next steps. It seems natural that the vacuum workers are assigned a portion
of the heap to scan and collect dead tuples (similar to what patch does)
and the same workers to be responsible for the second phase of heap scan.

But as far as index scans are concerned, I agree with Tom that the best
strategy is to assign a different index to each worker process and let them
vacuum indexes in parallel. That way the work for each worker process is
clearly cut out and they don't contend for the same resources, which means
the first patch to allow multiple backends to wait for a cleanup buffer is
not required. Later we could extend it further such multiple workers can
vacuum a single index by splitting the work on physical boundaries, but
even that will ensure clear demarkation of work and hence no contention on
index blocks.

ISTM this will require further work and it probably makes sense to mark the
patch as "Returned with feedback" for now.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Refactoring of heapam code.

2016-09-12 Thread Pavan Deolasee

On Tue, Sep 6, 2016 at 8:39 PM, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> 06.09.2016 07:44, Pavan Deolasee:
>
> 2. I don't understand the difference between PageGetItemHeapHeaderOnly()
> and PageGetItemHeap(). They seem to do exactly the same thing and can be
> used interchangeably.
>
>
> The only difference between these two macros is that
> PageGetItemHeapHeaderOnly() doesn't touch any key fields of a tuple,
> it only checks header fields (see HeapTupleHeaderData). I divided it into
> two separate functions, while I was working on page compression and
> I wanted to avoid unnecessary decompression calls. These names are
> just a kind of 'markers' to make the code reading and improving easier.
>
>
Ok. I still don't get it, but that's probably because I haven't seen a real
use case of that. Right now, both macros look exactly the same.


> 3. While I like the idea of using separate interfaces to get heap/index
> tuple from a page, may be they can internally use a common interface
> instead of duplicating what PageGetItem() does already.
>
>
> I don't sure I get it right. Do you suggest to leave PageGetItem and write
> PageGetItemHeap() and PageGetItemIndex() as wrappers on it?
> It sounds reasonable while we have similar layout for both heap and index
> pages.
> In any case, it'll be easy to separate them when necessary.
>
>
Yes, that's what I was thinking.


> 4. Should we rename PageGetItemHeap() to PageGetHeapTuple() or something
> like that?
>
>
> I don't feel like doing that, because HeapTuple is a different structure.
> What we do get from page is a HeapTupleHeaderData structure
> followed by user's data.
>

Ok, makes sense.


>
>
> I also looked at the refactoring design doc. Looks like a fine approach to
> me, but the API will need much more elaborate discussions. I am not sure if
> the interfaces as presented are enough to support everything that even
> heapam can do today.
>
>
> What features of heapam do you think could be unsupportable in this API?
> Maybe I've just missed them.
>

I was thinking about locking, bulk reading (like page-mode API) etc. While
you've added an API for vacuuming, we would probably also need an API to
collect dead tuples, pruning etc (not sure if that can be combined with
vacuum). Of course, these are probably very specific to current
implementation of heap/MVCC and not all storages will need that.


>
> I suggest refactoring, that will allow us to develop new heap-like access
> methods.
> For the first version, they must have methods to "heapify" tuple i.e turn
> internal
> representation of the data to regular HeapTuple, for example put some
> stubs into
> MVCC fields. Of course this approach has its disadvantages, such as
> performance issues.
> It definitely won't be enough to write column storage or to implement
> other great
> data structures. But it allows us not to depend of the Executor's code.
>
>
Ok, understood.


>
> - There are many upper modules that need access to system attributes
> (xmin, xmax for starters). How do you plan to handle that? You think we can
> provide enough abstraction so that the callers don't need to know the tuple
> structures of individual PAM?
>
>
> To be honest, I didn't thought about it.
> Do you mean external modules or upper levels of abstraction in Postgres?
>

I meant upper levels of abstraction like the executor. For example, while
inserting a new tuple, the executor (the index AM's insert routine to be
precise) may need to wait for another transaction to finish. Today, it can
easily get that information by looking at the xmax of the conflicting
tuple. How would we handle that with abstraction since not every PAM will
have a notion of xmax?

Thanks,
Pavan

 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 12:21 AM, Robert Haas  wrote:

> On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada 
> wrote:
> > Attached PoC patch changes the representation of dead tuple locations
> > to the hashmap having tuple bitmap.
> > The one hashmap entry consists of the block number and the TID bitmap
> > of corresponding block, and the block number is the hash key of
> > hashmap.
> > Current implementation of this patch is not smart yet because each
> > hashmap entry allocates the tuple bitmap with fixed
> > size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> > LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> > In case where one block can store only the several tens tuples, the
> > most bits are would be waste.
> >
> > After improved this patch as you suggested, I will measure performance
> benefit.
>
>
>
> Now, if I'm reading it correctly, this patch allocates a 132-byte
> structure for every page with at least one dead tuple.  In the worst
> case where there's just one dead tuple per page, that's a 20x
> regression in memory usage.  Actually, it's more like 40x, because
> AllocSetAlloc rounds small allocation sizes up to the next-higher
> power of two, which really stings for a 132-byte allocation, and then
> adds a 16-byte header to each chunk.  But even 20x is clearly not
> good.  There are going to be lots of real-world cases where this uses
> way more memory to track the same number of dead tuples, and I'm
> guessing that benchmarking is going to reveal that it's not faster,
> either.
>
>
Sawada-san offered to reimplement the patch based on what I proposed
upthread. In the new scheme of things, we will allocate a fixed size bitmap
of length 2D bits per page where D is average page density of live + dead
tuples. (The rational behind multiplying D by a factor of 2 is to consider
worst case scenario where every tuple also has a LP_DIRECT line
pointer). The value of D in most real world, large tables should not go
much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That
translates to about 25 bytes/page. So all TIDs with offset less than 2D can
be represented by a single bit. We augment this with an overflow to track
tuples which fall outside this limit. I believe this area will be small,
say 10% of the total allocation.

This representation is at least as good the current representation if there
are at least 4-5% dead tuples. I don't think very large tables will be
vacuumed with a scale factor less than that. And assuming 10% dead tuples,
this representation will actually be much more optimal.

The idea can fail when (a) there are very few dead tuples in the table, say
less than 5% or (b) there are large number of tuples falling outside the 2D
limit. While I don't expect any of these to hold for real world, very large
tables,  (a) we can anticipate when the vacuum starts and use current
representation. (b) we can detect at run time and do a one time switch
between representation. You may argue that managing two representations is
clumsy, which I agree, but the code is completely isolated and probably not
more than a few hundred lines.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 8:47 AM, Pavan Deolasee 
wrote:

>
>>
> Sawada-san offered to reimplement the patch based on what I proposed
> upthread. In the new scheme of things, we will allocate a fixed size bitmap
> of length 2D bits per page where D is average page density of live + dead
> tuples. (The rational behind multiplying D by a factor of 2 is to consider
> worst case scenario where every tuple also has a LP_DIRECT line
> pointer). The value of D in most real world, large tables should not go
> much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That
> translates to about 25 bytes/page. So all TIDs with offset less than 2D can
> be represented by a single bit. We augment this with an overflow to track
> tuples which fall outside this limit. I believe this area will be small,
> say 10% of the total allocation.
>
>
So I cooked up the attached patch to track number of live/dead tuples found
at offset 1 to MaxOffsetNumber. The idea was to see how many tuples
actually go beyond the threshold of 2D offsets per page. Note that I am
proposing to track 2D offsets via bitmap and rest via regular TID array.

So I ran a pgbench test for 2hrs with scale factor 500. autovacuum scale
factor was set to 0.1 or 10%.

Some interesting bits:

postgres=# select relname, n_tup_ins, n_tup_upd, n_tup_hot_upd, n_live_tup,
n_dead_tup, pg_relation_size(relid)/8192 as relsize,
(n_live_tup+n_dead_tup)/(pg_relation_size(relid)/8192) as density from
pg_stat_user_tables ;
 relname  | n_tup_ins | n_tup_upd | n_tup_hot_upd | n_live_tup |
n_dead_tup | relsize | density
--+---+---+---+++-+-
 pgbench_tellers  |  5000 |  95860289 |  87701578 |   5000 |
   0 |3493 |   1
 pgbench_branches |   500 |  95860289 |  94158081 |967 |
   0 |1544 |   0
 pgbench_accounts |  5000 |  95860289 |  93062567 |   51911657 |
 3617465 |  865635 |  64
 pgbench_history  |  95860289 | 0 | 0 |   95258548 |
   0 |  610598 | 156
(4 rows)

Smaller tellers and branches tables bloat so much that the density as
computed by live + dead tuples falls close to 1 tuple/page. So for such
tables, the idea of 2D bits/page will fail miserably. But I think these
tables are worst representatives and I would be extremely surprised if we
ever find very large table bloated so much. But even then, this probably
tells us that we can't solely rely on the density measure.

Another interesting bit about these small tables is that the largest used
offset for these tables never went beyond 291 which is the value of
MaxHeapTuplesPerPage. I don't know if there is something that prevents
inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't
know at this point if this gives us upper limit for bits per page (may be
it does).

For pgbench_accounts table, the maximum offset used was 121, though bulk of
the used offsets were at the start of the page (see attached graph). Now
the test did not create enough dead tuples to trigger autovacuum on
pgbench_accounts table. So I ran a manul vacuum at the end. (There are
about 5% dead tuples in the table by the time test finished)

postgres=# VACUUM VERBOSE pgbench_accounts ;
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 2797722 row versions
DETAIL:  CPU 0.00s/9.39u sec elapsed 9.39 sec.
INFO:  "pgbench_accounts": removed 2797722 row versions in 865399 pages
DETAIL:  CPU 0.10s/7.01u sec elapsed 7.11 sec.
INFO:  index "pgbench_accounts_pkey" now contains 5000 row versions in
137099 pages
DETAIL:  2797722 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pgbench_accounts": found 852487 removable, 5000 nonremovable
row versions in 865635 out of 865635 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 802256 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.73s/27.20u sec elapsed 27.98 sec.tuple count at each offset
(offnum:all_tuples:dead_tuples)

For 2797722 dead line pointers, the current representation would have used
2797722  x 6 = 16786332 bytes of memory. The most optimal bitmap would have
used 121 bits/page x 865399 pages = 13089159 bytes where as if we had
provisioned 2D bits/page and assuming D = 64 based on the above
calculation, we would have used 13846384 bytes of memory. This is about 18%
less than the current representation. Of course, we would have allocated
some space for overflow region, which will make the difference
smaller/negligible. But the bitmaps would be extremely cheap to lookup
during index scans.

Now may be I got lucky, may be I did nor run tests long enough (though I
believe that may have work

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 3:43 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> Hi All,
> While working on partition-wise join, I had to examine Relids objects
> many times. Printing the Bitmapset::words[] in binary format and then
> inferring the relids takes time and is error prone. Instead I wrote a
> function bms_to_char() which allocates a StringInfo, calls
> outBitmapset() to decode Bitmapset as a set of integers and returns
> the string. In order to examine a Relids object all one has to do is
> execute 'p bms_to_char(bms_object) under gdb.
>
>
Can we not do this as gdb macros? My knowledge is rusty in this area and
lately I'm using LVM debugger (which probably does not have something
equivalent), but I believe gdb allows you to write useful macros. As a
bonus, you can then use them even for inspecting core files.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 3:46 PM, Pavan Deolasee 
wrote:

>
>
>  lately I'm using LVM debugger (which probably does not have something
> equivalent),
>

And I was so clueless about lldb's powerful scripting interface. For
example, you can write something like this in bms_utils.py:

import lldb

def print_bms_members (bms):
words = bms.GetChildMemberWithName("words")
nwords = int(bms.GetChildMemberWithName("nwords").GetValue())

ret = 'nwords = {0} bitmap: '.format(nwords,)
for i in range(0, nwords):
ret += hex(int(words.GetChildAtIndex(0, lldb.eNoDynamicValues,
True).GetValue()))

return ret

And then do this while attached to lldb debugger:

Process 99659 stopped
* thread #1: tid = 0x59ba69, 0x0001090b012f
postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at
bitmapset.c:673, queue = 'com.apple.main-thread', stop reason = breakpoint
1.1
frame #0: 0x0001090b012f
postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673
   670 int wordnum,
   671 bitnum;
   672
-> 673 if (x < 0)
   674 elog(ERROR, "negative bitmapset member not allowed");
   675 if (a == NULL)
   676 return bms_make_singleton(x);
(lldb) script
Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D.
>>> from bms_utils import *
>>> bms = lldb.frame.FindVariable ("a")
>>> print print_bms_members(bms)
nwords = 1 bitmap: 0x200


The complete API reference is available here
http://lldb.llvm.org/python_reference/index.html

Looks like an interesting SoC project to write useful lldb/gdb scripts to
print internal structures for ease of debugging :-)

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 5:32 PM, Robert Haas  wrote:

> On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
>  wrote:
> > Another interesting bit about these small tables is that the largest used
> > offset for these tables never went beyond 291 which is the value of
> > MaxHeapTuplesPerPage. I don't know if there is something that prevents
> > inserting more than  MaxHeapTuplesPerPage offsets per heap page and I
> don't
> > know at this point if this gives us upper limit for bits per page (may
> be it
> > does).
>
> From PageAddItemExtended:
>
> /* Reject placing items beyond heap boundary, if heap */
> if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage)
> {
> elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
> in a heap page");
> return InvalidOffsetNumber;
> }
>
> Also see the comment where MaxHeapTuplesPerPage is defined:
>
>  * Note: with HOT, there could theoretically be more line pointers (not
> actual
>  * tuples) than this on a heap page.  However we constrain the number of
> line
>  * pointers to this anyway, to avoid excessive line-pointer bloat and not
>  * require increases in the size of work arrays.
>
>
Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per
page bitmap size. Thats about 36 bytes for 8K page. IOW if on an average
there are 6 or more dead tuples per page, bitmap will outperform the
current representation, assuming max allocation for bitmap. If we can use
additional estimates to restrict the size to somewhat more conservative
value and then keep overflow area, then probably the break-even happens
even earlier than that. I hope this gives us a good starting point, but let
me know if you think it's still a wrong approach to pursue.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 8:47 PM, Robert Haas  wrote:

>
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.


Yeah, I wouldn't worry only about lookup speedup, but if does speeds up,
that's a bonus. But the bitmaps seem to win even for memory consumption. As
theory and experiments both show, at 10% dead tuple ratio, bitmaps will win
handsomely.

 In
> theory we could even start with the list of TIDs and switch to the
> bitmap if the TID list becomes larger than the bitmap would have been,
> but I don't know if it's worth the effort.
>
>
Yes, that works too. Or may be even better because we already know the
bitmap size requirements, definitely for the tuples collected so far. We
might need to maintain some more stats to further optimise the
representation, but that seems like unnecessary detailing at this point.


>
> On the other hand, if we switch to the bitmap as the ONLY possible
> representation, we will lose badly when there are scattered updates -
> e.g. 1 deleted tuple every 10 pages.


Sure. I never suggested that. What I'd suggested is to switch back to array
representation once we realise bitmaps are not going to work. But I see
it's probably better the other way round.



> So it seems like we probably
> want to have both options.  One tricky part is figuring out how we
> switch between them when memory gets tight; we have to avoid bursting
> above our memory limit while making the switch.


Yes, I was thinking about this problem. Some modelling will be necessary to
ensure that we don't go (much) beyond the maintenance_work_mem while
switching representation, which probably means you need to do that earlier
than necessary.


> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.
>
>
Yes seems like a good idea. Another idea that I'd in mind is to use some
sort of indirection map where bitmap for every block or a set of blocks
will either be recorded or not, depending on whether a bit is set for the
range. If the bitmap exists, the indirection map will give out the offset
into the larger bitmap area. Seems similar to what you described.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera 
wrote:

>
>
> One thing not quite clear to me is how do we create the bitmap
> representation starting from the array representation in midflight
> without using twice as much memory transiently.  Are we going to write
> the array to a temp file, free the array memory, then fill the bitmap by
> reading the array from disk?
>

We could do that. Or may be compress TID array when consumed half m_w_m and
do this repeatedly with remaining memory. For example, if we start with 1GB
memory, we decide to compress at 512MB. Say that results in 300MB for
bitmap. We then continue to accumulate TID and do another round of fold up
when another 350MB is consumed.

I think we should maintain per offset count of number of dead tuples to
choose the most optimal bitmap size (that needs overflow region). We can
also track how many blocks or block ranges have at least one dead tuple to
know if it's worthwhile to have some sort of indirection. Together that can
tell us how much compression can be achieved and allow us to choose the
most optimal representation.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-15 Thread Pavan Deolasee

On Thu, Sep 15, 2016 at 7:38 PM, Robert Haas  wrote:

>
>
> This seems like a very complicated mechanism of substituting for a
> very simple patch.

I don't have objection to the patch per se. The point of posting this was
just to share other mechanisms that exists. BTW advantage of using debugger
scripts is that they also work while inspecting core dumps.

> Your LLDB script is about the same number of lines
> as Ashutosh's patch and only works for people who use LLDB.

Alvaro pointed out that gdb also have similar capabilities.

>   Plus,
> once you write it, you've got to enter the Python interpreter to use
> it and then run three more lines of code that aren't easy to remember.
>
In contrast, with Ashutosh's proposed patch, you just write:
>
> p bms_to_char(bms_object)
>
>
I learnt this yesterday and I am sure there are easier ways to do the same
thing. I just don't know. For example, you can also do this:

(lldb) script print print_bms_members(lldb.frame.FindVariable ("a"))
nwords = 1 bitmap: 0x200

It's still slightly cumbersome, but better than entering the interpreter.

> ...and you're done.  Now, I grant that his approach bloats the binary
> and yours does not, but nobody complains about pprint() bloating the
> binary.

Sure. I wasn't aware of existence of pprint() either and may be that's
enough from debugging perspective. When I tried that yesterday, the output
went to the logfile instead of coming on the debugger prompt. May be I did
something wrong or may be that's not inconvenient for those who use it
regularly.

So yeah, no objections to the patch. I was happy to discover what I did and
thought of sharing assuming others might find it useful too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-15 Thread Pavan Deolasee

On Thu, Sep 15, 2016 at 7:55 PM, Pavan Deolasee 
wrote:

>
> (lldb) script print print_bms_members(lldb.frame.FindVariable ("a"))
> nwords = 1 bitmap: 0x200
>
>
Or even this if lldb.frame.FindVariable() is pushed inside the function:

(lldb) script print print_bms_members('a')
nwords = 1 bitmap: 0x200

Thanks,
Pavan

--
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Pavan Deolasee

On Fri, Sep 16, 2016 at 12:24 AM, Claudio Freire 
wrote:

> On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra
>  wrote:
> > For example, we always allocate the TID array as large as we can fit into
> > m_w_m, but maybe we don't need to wait with switching to the bitmap until
> > filling the whole array - we could wait as long as the bitmap fits into
> the
> > remaining part of the array, build it there and then copy it to the
> > beginning (and use the bitmap from that point).
>
> The bitmap can be created like that, but grow from the end of the
> segment backwards.
>
> So the scan can proceed until the bitmap fills the whole segment
> (filling backwards), no copy required.
>

I thought about those approaches when I suggested starting with half m_w_m.
So you fill in TIDs from one end and upon reaching half point, convert that
into bitmap (assuming stats tell you that there is significant savings with
bitmaps) but fill it from the other end. Then reset the TID array and start
filling up again. That guarantees that you can always work within available
limit.

But I actually wonder if we are over engineering things and overestimating
cost of memmove etc. How about this simpler approach:

1. Divide table in some fixed size chunks like Robert suggested. Say 1GB
2. Allocate pointer array to store a pointer to each segment. For 1TB
table, thats about 8192 bytes.
3. Allocate a bitmap which can hold MaxHeapTuplesPerPage * chunk size in
pages. For 8192 block and 1GB chunk, thats about 4.6MB. *Note*: I'm
suggesting to use a bitmap here because provisioning for worst case, fixed
size TID array will cost us 200MB+ where as a bitmap is just 4.6MB.
4. Collect dead tuples in that 1GB chunk. Also collect stats so that we
know about the most optimal representation.
5. At the end of 1GB scan, if no dead tuples found, set the chunk pointer
to NULL, move to next chunk and restart step 4. If dead tuples found, then
check:
   a. If bitmap can be further compressed by using less number of bits per
page. If so, allocate a new bitmap and compress the bitmap.
   b. If TID array will be a more compact representation. If so, allocate a
TID array of right size and convert bitmap into an array.
   c. Set chunk pointer to whichever representation we choose (of course
add headers etc to interpret the representation)
6. Continue until we consume all m_w_m or end-of-table is reached. If we
consume all m_w_m then do a round of index cleanup and restart.

I also realised that we can compact the TID array in step (b) above because
we only need to store 17 bits for block numbers (we already know which 1GB
segment they belong to). Given that usable offsets are also just 13 bits,
TID array needs only 4 bytes per TID instead of 6.

Many people are working on implementing different ideas, and I can
volunteer to write a patch on these lines unless someone wants to do that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Pavan Deolasee

On Fri, Sep 16, 2016 at 9:09 AM, Pavan Deolasee 
wrote:

>
> I also realised that we can compact the TID array in step (b) above
> because we only need to store 17 bits for block numbers (we already know
> which 1GB segment they belong to). Given that usable offsets are also just
> 13 bits, TID array needs only 4 bytes per TID instead of 6.
>
>
Actually this seems like a clear savings of at least 30% for all use cases,
at the cost of allocating in smaller chunks and doing some transformations.
But given the problem we are trying to solve, seems like a small price to
pay.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-16 Thread Pavan Deolasee

On Fri, Sep 16, 2016 at 7:03 PM, Robert Haas  wrote:

> On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
>  wrote:
> > But I actually wonder if we are over engineering things and
> overestimating
> > cost of memmove etc. How about this simpler approach:
>
> Don't forget that you need to handle the case where
> maintenance_work_mem is quite small.
>
>
How small? The default IIRC these days is 64MB and minimum is 1MB. I think
we can do some special casing for very small values and ensure that things
at the very least work and hopefully don't regress for them.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Use of SizeOfIptrData - is that obsolete?

2016-09-20 Thread Pavan Deolasee

Hi,

I happened to notice this comment in src/include/storage/itemptr.h

 * Note: because there is an item pointer in each tuple header and index
 * tuple header on disk, it's very important not to waste space with
 * structure padding bytes.  The struct is designed to be six bytes long
 * (it contains three int16 fields) but a few compilers will pad it to
 * eight bytes unless coerced.  We apply appropriate persuasion where
 * possible, and to cope with unpersuadable compilers, we try to use
 * "SizeOfIptrData" rather than "sizeof(ItemPointerData)" when computing
 * on-disk sizes.
 */

Is that now obsolete? I mean I can still find one reference
in src/backend/executor/nodeTidscan.c, which uses SizeOfIptrData instead of
sizeof (ItemPointerData), but all other places such as heapam_xlog.h now
use sizeof operator. It was changed in 2c03216d831160b when we moved a
bunch of xlog related stuff from htup.h to this new file. Hard to tell if
there were other users before that and they were all dropped in this one
commit or various commits leading to this.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Use of SizeOfIptrData - is that obsolete?

2016-09-20 Thread Pavan Deolasee

On Tue, Sep 20, 2016 at 8:34 PM, Tom Lane  wrote:

> Pavan Deolasee  writes:
> > I happened to notice this comment in src/include/storage/itemptr.h
>
> >  * Note: because there is an item pointer in each tuple header and index
> >  * tuple header on disk, it's very important not to waste space with
> >  * structure padding bytes.  The struct is designed to be six bytes long
> >  * (it contains three int16 fields) but a few compilers will pad it to
> >  * eight bytes unless coerced.  We apply appropriate persuasion where
> >  * possible, and to cope with unpersuadable compilers, we try to use
> >  * "SizeOfIptrData" rather than "sizeof(ItemPointerData)" when computing
> >  * on-disk sizes.
> >  */
>
> > Is that now obsolete?
>
> Realistically, because struct HeapTupleHeaderData contains a field of
> type ItemPointerData, it's probably silly to imagine that we can save
> anything if the compiler can't be persuaded to believe that
> sizeof(ItemPointerData) is 6.  It may well be that the structure pragmas
> work on everything that wouldn't natively believe that anyway.
>

Yeah, that's what I thought and rest of the code seems to make that
assumption as well. Attached patch removes the last remaining reference to
SizeOfIptrData and also removes the macro and the associated comment. TBH I
couldn't fully trace how the TID array gets generated in nodeTidscan.c, but
I'm fairly confident that this should not break anything because this was
the only remaining reference.

While htup.h refactoring happened in 9.5, I don't see any point in back
patching this.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_remove_itemptr_size.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Pavan Deolasee

On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra 
wrote:

> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>
>> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>>  wrote:
>>
>>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>
>>>>
>>>>
>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>> eliminated the fluctuations in the results. For example when 3.2.80
>>> produced
>>> this results (10 runs with the same parameters):
>>>
>>> 12118 11610 27939 11771 18065
>>> 12152 14375 10983 13614 11077
>>>
>>> we get this on 4.5.5
>>>
>>> 37354 37650 37371 37190 37233
>>> 38498 37166 36862 37928 38509
>>>
>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>
>>>
>> how long each run was?  Generally, I do half-hour run to get stable
>> results.
>>
>>
> 10 x 5-minute runs for each client count. The full shell script driving
> the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
> this:
>
> for r in `seq 1 $runs`; do
> for c in 1 8 16 32 64 128 192; do
> psql -c checkpoint
> pgbench -j 8 -c $c ...
> done
> done

I see couple of problems with the tests:

1. You're running regular pgbench, which also updates the small tables. At
scale 300 and higher clients, there is going to heavy contention on the
pgbench_branches table. Why not test with pgbench -N? As far as this patch
is concerned, we are only interested in seeing contention on
ClogControlLock. In fact, how about a test which only consumes an XID, but
does not do any write activity at all? Complete artificial workload, but
good enough to tell us if and how much the patch helps in the best case. We
can probably do that with a simple txid_current() call, right?

2. Each subsequent pgbench run will bloat the tables. Now that may not be
such a big deal given that you're checkpointing between each run. But it
still makes results somewhat hard to compare. If a vacuum kicks in, that
may have some impact too. Given the scale factor you're testing, why not
just start fresh every time?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Use of SizeOfIptrData - is that obsolete?

2016-09-25 Thread Pavan Deolasee

On Fri, Sep 23, 2016 at 12:04 AM, Tom Lane  wrote:

>
> I thought removing the comment altogether was not appropriate, because
> it remains true that you want to work really hard to ensure that
> sizeof(ItemPointerData) is 6.  We're just giving up on pretense of support
> for compilers that don't believe that


Makes sense.


> .  I'm half tempted to introduce a
> StaticAssert about it, but refrained for the moment.
>
>
I also thought about that and it probably makes sense, at least to see how
buildfarm behaves. One reason to do so is that I did not find any
discussion or evidence of why SizeOfIptrData magic is no longer necessary
and to see if it was an unintentional change at some point.


> > While htup.h refactoring happened in 9.5, I don't see any point in back
> > patching this.
>
> Agreed.  Pushed to HEAD only.
>

Thanks.


Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Refactoring of heapam code.

2016-09-25 Thread Pavan Deolasee

On Tue, Sep 6, 2016 at 8:39 PM, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> 06.09.2016 07:44, Pavan Deolasee:
>
>
> I don't know what to do with the CF entry itself. I could change the
> status to "waiting on author" but then the design draft may not get enough
> attention. So I'm leaving it in the current state for others to look at.
>
>
> I'll try to update patches as soon as possible. Little code cleanup will
> be useful
> regardless of final refactoring design.
>
>
I'm marking the patch as "Returned with Feedback". I assume you'll submit
fresh set of patches for the next CF anyways.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-10-05 Thread Pavan Deolasee

On Wed, Oct 5, 2016 at 1:43 PM, Tomas Vondra 
wrote:

>
>
> I've been looking at the patch over the past few days, running a bunch of
> benchmarks etc.

Thanks for doing that.

> I can confirm the significant speedup, often by more than 75% (depending
> on number of indexes, whether the data set fits into RAM, etc.). Similarly
> for the amount of WAL generated, although that's a bit more difficult to
> evaluate due to full_page_writes.
>
> I'm not going to send detailed results, as that probably does not make
> much sense at this stage of the development - I can repeat the tests once
> the open questions get resolved.
>

Sure. Anything that stands out? Any regression that you see? I'm not sure
if your benchmarks exercise the paths which might show overheads without
any tangible benefits. For example, I wonder if a test with many indexes
where most of them get updated and then querying the table via those
updated indexes could be one such test case.

>
> There's a lot of useful and important feedback in the thread(s) so far,
> particularly the descriptions of various failure cases. I think it'd be
> very useful to collect those examples and turn them into regression tests -
> that's something the patch should include anyway.
>

Sure. I added only a handful test cases which I knew regression isn't
covering. But I'll write more of them. One good thing is that the code gets
heavily exercised even during regression. I caught and fixed multiple bugs
running regression. I'm not saying that's enough, but it certainly gives
some confidence.

>
> and update:
>
> update t set a = a+1, b=b+1;
>
> which has to update all indexes on the table, but:
>
> select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables
>
>  n_tup_upd | n_tup_hot_upd
> ---+---
>   1000 |  1000
>
> So it's still counted as "WARM" - does it make sense?

No, it does not. The code currently just marks any update as a WARM update
if the table supports it and there is enough free space in the page. And
yes, you're right. It's worth fixing that because of one-WARM update per
chain limitation. Will fix.

>
> The way this is piggy-backed on the current HOT statistics seems a bit
> strange for another reason,

Agree. We could add a similar n_tup_warm_upd counter.

> But WARM changes that - it allows adding index entries only to a subset of
> indexes, which means the "per row" n_tup_hot_upd counter is not sufficient.
> When you have a table with 10 indexes, and the counter increases by 1, does
> that mean the update added index tuple to 1 index or 9 of them?
>

How about having counters similar to n_tup_ins/n_tup_del for indexes as
well? Today it does not make sense because every index gets the same number
of inserts, but WARM will change that.

For example, we could have idx_tup_insert and idx_tup_delete that shows up
in pg_stat_user_indexes. I don't know if idx_tup_delete adds any value, but
one can then look at idx_tup_insert for various indexes to get a sense
which indexes receives more inserts than others. The indexes which receive
more inserts are the ones being frequently updated as compared to other
indexes.

This also relates to vacuuming strategies. Today HOT updates do not count
for triggering vacuum (or to be more precise, HOT pruned tuples are
discounted while counting dead tuples). WARM tuples get the same treatment
as far as pruning is concerned, but since they cause fresh index inserts, I
wonder if we need some mechanism to cleanup the dead line pointers and dead
index entries. This will become more important if we do something to
convert WARM chains into HOT chains, something that only VACUUM can do in
the design I've proposed so far.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] FSM corruption leading to errors

2016-10-06 Thread Pavan Deolasee

PR06MB504CD8FE8AA30D4B7C958AAE39E0%40AMSPR06MB504.eurprd06.prod.outlook.com


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_fsm_corruption_fix.patch
Description: Binary data


pg_trigger_fsm_error.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

2016-10-10 Thread Pavan Deolasee

On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier 
wrote:

>
>
> +   /*
> +* See comments in GetPageWithFreeSpace about handling outside the
> valid
> +* range blocks
> +*/
> +   nblocks = RelationGetNumberOfBlocks(rel);
> +   while (target_block >= nblocks && target_block != InvalidBlockNumber)
> +   {
> +   target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
> +   spaceNeeded);
> +   }
> Hm. This is just a workaround. Even if things are done this way the
> FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range
blocks. But now that I look at it again, may be this is not correct and it
may get into an endless loop if the relation is repeatedly extended
concurrently.

> And isn't that going to break once the
> relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I
added the above code to mostly deal with already corrupt FSMs. May be we
can just document and leave it to the user to run some correctness checks
(see below), especially given that the code is not cheap and adds overheads
for everybody, irrespective of whether they have or will ever have corrupt
FSM.

> I'd suggest instead putting in the release
> notes a query that allows one to analyze what are the relations broken
> and directly have them fixed. That's annoying, but it would be really
> better than a workaround. One idea here is to use pg_freespace() and
> see if it returns a non-zero value for an out-of-range block on a
> standby.
>
>
Right, that's how I tested for broken FSMs. A challenge with any such query
is that if the shared buffer copy of the FSM page is intact, then the query
won't return problematic FSMs. Of course, if the fix is applied to the
standby and is restarted, then corrupt FSMs can be detected.

>
> At the same time, I have translated your script into a TAP test, I
> found that more useful when testing..
>
> Thanks for doing that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-17 Thread Pavan Deolasee

On Tue, Oct 11, 2016 at 5:20 AM, Michael Paquier 
wrote:

>
> >
> > Once the underlying bug is fixed, I don't see why it should break again.
> I
> > added the above code to mostly deal with already corrupt FSMs. May be we
> can
> > just document and leave it to the user to run some correctness checks
> (see
> > below), especially given that the code is not cheap and adds overheads
> for
> > everybody, irrespective of whether they have or will ever have corrupt
> FSM.
>
> Yep. I'd leave it for the release notes to hold a diagnostic method.
> That's annoying, but this has been done in the past like for the
> multixact issues..
>

I'm okay with that. It's annoying, especially because the bug may show up
when your primary is down and you just failed over for HA, only to find
that the standby won't work correctly. But I don't have ideas how to fix
existing corruption without adding significant penalty to normal path.

>
> What if you restart the standby, and then do a diagnostic query?
> Wouldn't that be enough? (Something just based on
> pg_freespace(pg_relation_size(oid) / block_size) != 0)
>
>
Yes, that will enough once the fix is in place.

I think this is a major bug and I would appreciate any ideas to get the
patch in a committable shape before the next minor release goes out. We
probably need a committer to get interested in this to make progress.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-17 Thread Pavan Deolasee

On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas  wrote:

>
>>
> visibilitymap_truncate is actually also wrong, in a different way. The
> truncation WAL record is written only after the VM (and FSM) are truncated.
> But visibilitymap_truncate() has already modified and dirtied the page. If
> the VM page change is flushed to disk before the WAL record, and you crash,
> you might have a torn VM page and a checksum failure.
>
>
Right, I missed the problem with checksums.


> Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
> FreeSpaceMapTruncateRel would have the same issue. If you call
> MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
> to make sure the WAL record is flushed first.
>
>
I'm not sure I fully understand this. If the page is flushed before the WAL
record, we might have a truncated FSM where as the relation truncation is
not replayed. But that's not the same problem, right? I mean, you might
have an FSM which is not accurate, but it won't at the least return the
blocks outside the range. Having said that, I agree your proposed changes
are more robust.

BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode. While most of
the reports so far involved standbys, and the bug can also hit a standalone
master during crash recovery, I wonder if a race can occur even on a live
system.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee

On Wed, Oct 19, 2016 at 2:37 PM, Heikki Linnakangas  wrote:

>
>>
> Actually, this is still not 100% safe. Flushing the WAL before modifying
> the FSM page is not enough. We also need to WAL-log a full-page image of
> the FSM page, otherwise we are still vulnerable to the torn page problem.
>
> I came up with the attached. This is fortunately much simpler than my
> previous attempt. I replaced the MarkBufferDirtyHint() calls with
> MarkBufferDirty(), to fix the original issue, plus WAL-logging a full-page
> image to fix the torn page issue.
>
>
Looks good to me.


> BTW any thoughts on race-condition on the primary? Comments at
>> MarkBufferDirtyHint() seems to suggest that a race condition is possible
>> which might leave the buffer without the DIRTY flag, but I'm not sure if
>> that can only happen when the page is locked in shared mode.
>>
>
> I think the race condition can only happen when the page is locked in
> shared mode. In any case, with this proposed fix, we'll use
> MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.
>
>
Yes, the fix will cover that problem (if it exists). The reason why I was
curious to know is because there are several reports of similar error in
the past and some of them did not involve as standby. Those reports mostly
remained unresolved and I wondered if this explains them. But yeah, my
conclusion was that the race is not possible with page locked in EXCLUSIVE
mode. So may be there is another problem somewhere or a crash recovery may
have left the FSM in inconsistent state.

Anyways, we seem good to go with the patch.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee

On Wed, Oct 19, 2016 at 6:44 PM, Heikki Linnakangas  wrote:

> On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:
>
>>
>>>
>> Oh, forgot that this needs to be backported, of course. Will do that
>> shortly...
>>
>
> Done.
>

Thanks!


>
> This didn't include anything to cope with an already-corrupt FSM, BTW. Do
> we still want to try something for that? I think it's good enough if we
> prevent the FSM corruption from happening, but not sure what the consensus
> on that might be..
>
>
I thought it will be nice to handle already corrupt FSM since our customer
found it immediately after a failover and then it was a serious issue. In
one case, a system table was affected, thus preventing all DDLs from
running. Having said that, I don't have a better idea to handle the problem
without causing non-trivial overhead for normal cases (see my original
patch). If you've better ideas, it might be worth pursuing.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Indirect indexes

2016-10-19 Thread Pavan Deolasee

On Wed, Oct 19, 2016 at 7:19 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Wed, Oct 19, 2016 at 3:52 PM, Robert Haas 
> wrote:
>
>> The VACUUM problems seem fairly serious.  It's true that these indexes
>> will be less subject to bloat, because they only need updating when
>> the PK or the indexed columns change, not when other indexed columns
>> change.  On the other hand, there's nothing to prevent a PK from being
>> recycled for an unrelated tuple.  We can guarantee that a TID won't be
>> recycled until all index references to the TID are gone, but there's
>> no such guarantee for a PK.  AFAICT, that would mean that an indirect
>> index would have to be viewed as unreliable: after looking up the PK,
>> you'd *always* have to recheck that it actually matched the index
>> qual.
>>
>
> AFAICS, even without considering VACUUM, indirect indexes would be always
> used with recheck.
> As long as they don't contain visibility information.  When indirect
> indexed column was updated, indirect index would refer same PK with
> different index keys.
> There is no direct link between indirect index tuple and heap tuple, only
> logical link using PK.  Thus, you would anyway have to recheck.
>
>
I agree. Also, I think the recheck mechanism will have to be something like
what I wrote for WARM i.e. only checking for index quals won't be enough
and we would actually need to verify that the heap tuple satisfies the key
in the indirect index.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee

On Wed, Oct 19, 2016 at 6:54 PM, Tom Lane  wrote:

>
>
> Can we document an existing procedure for repairing FSM corruption?
> (VACUUM, maybe?)

I'm afraid it may not be easy to repair the corruption with existing
facilities. Most often the corruption will be on the standby and a VACUUM
may not actually touch affected pages on the master (because they may not
even exists on the master or skipped because of visibility maps). It may
not even trigger relation truncation. So AFAICS it may not generate any WAL
activity that can fix the corruption on the standby.

One possible way would be to delete the FSM (and VM) information on the
master and standby and then run VACUUM so these maps are rebuilt. We
obviously don't need to do this for all tables, but we need a way to find
the tables with corrupt FSM [1].

Suggested procedure could be:

1. Upgrade master and standby to the latest minor release (which involves
restart)
2. Install pg_freespace extension and run the query [1] on master to find
possible corruption cases. The query checks if FSM reports free space in a
block outside the size of the relation. Unfortunately, we might have false
positives if the relation is extended while the query is running.
3. Repeat the same query on standby (if it's running in Hot standby mode,
otherwise the corruption can only be detected once it's promoted to be a
master)
4. Remove FSM and VM files for the affected tables (I don't think if it's
safe to do this on a running server)
5. VACUUM affected tables so that FSM and VM is rebuilt.

Another idea is to implement a pg_freespace_repair() function in
pg_freespace which takes an AccessExclusiveLock on the table and truncates
it to it's current size, thus generating a WAL record that the standby will
replay to fix the corruption. This probably looks more promising, easy to
explain and less error prone.

[1]  SELECT *
  FROM (
SELECT oid::regclass as relname, EXISTS (
SELECT *
  FROM (
SELECT blkno, pg_freespace(oid::regclass, blkno)
  FROM
generate_series(pg_relation_size(oid::regclass)/
current_setting('block_size')::bigint, pg_relation_size(oid::regclass,
'fsm') / 2) as blkno
   ) as avail
 WHERE pg_freespace > 0
   ) as corrupt_fsm
  FROM pg_class
 WHERE relkind = 'r'
   ) b
 WHERE b.corrupt_fsm = true;

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee

On Thu, Oct 20, 2016 at 10:50 AM, Michael Paquier  wrote:

>
>  For VMs a good way would
> be to use pg_visibility's pg_truncate_visibility_map(), but only for
> 9.6~.

Ah ok..

> For FSM there is no real solution, and actually a
> pg_truncate_fsm would prove to be useful here.

Right, that's what I proposed as an alternate idea. I agree this is much
cleaner and less error prone approach.

Actually, if we could add an API which can truncate FSM to the given heap
block, then the user may not even need to run VACUUM, which could be costly
for very large tables. Also, AFAICS we will need to backport
pg_truncate_visibility_map() to older releases because unless the VM is
truncated along with the FSM, VACUUM may not scan all pages and the FSM for
those pages won't be recorded.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee

On Thu, Oct 20, 2016 at 11:34 AM, Michael Paquier  wrote:

> On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
>  wrote:
> > Actually, if we could add an API which can truncate FSM to the given heap
> > block, then the user may not even need to run VACUUM, which could be
> costly
> > for very large tables.
>
> FreeSpaceMapTruncateRel()?
>

Right. We need a SQL callable function to invoke that.

>
> > Also, AFAICS we will need to backport
> > pg_truncate_visibility_map() to older releases because unless the VM is
> > truncated along with the FSM, VACUUM may not scan all pages and the FSM
> for
> > those pages won't be recorded.
>
> This would need a careful lookup, and it would not be that complicated
> to implement. And this bug justifies the presence of a tool like that
> actually... But usually those don't get a backpatch, so the
> probability of getting that backported is low IMO, personally I am not
> sure I like that either.
>

Just to clarify, I meant if we truncate the entire FSM then we'll need API
to truncate VM as well so that VACUUM rebuilds everything completely. OTOH
if we provide a function just to truncate FSM to match the size of the
table, then we don't need to rebuild the FSM. So that's probably a better
way to handle FSM corruption, as far as this particular issue is concerned.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Indirect indexes

2016-10-20 Thread Pavan Deolasee

On Thu, Oct 20, 2016 at 8:44 PM, Petr Jelinek  wrote:

>
>
> WARM can do WARM update 50% of time, indirect index can do HOT update
> 100% of time (provided the column is not changed), I don't see why we
> could not have both solutions.
>
>
I think the reason why I restricted WARM to one update per chain, also
applies to indirect index. For example, if a indirect column value is
changed from 'a' to 'b' and back to 'a', there will be two pointers from
'a' to the PK and AFAICS that would lead to the same duplicate scan issue.

We have a design to convert WARM chains back to HOT and that will increase
the percentage of WARM updates much beyond 50%. I was waiting for feedback
on the basic patch before putting in more efforts, but it went unnoticed
last CF.


> That all being said, it would be interesting to hear Álvaro's thoughts
> about which use-cases he expects indirect indexes to work better than WARM.
>
>
Yes, will be interesting to see that comparison. May be we need both or may
be just one. Even better may be they complement each other.. I'll also put
in some thoughts in this area.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Indirect indexes

2016-10-20 Thread Pavan Deolasee

On Thu, Oct 20, 2016 at 9:20 PM, Claudio Freire 
wrote:

>
>
> With indirect indexes, since you don't need to insert a tid, you can
> just "insert on conflict do nothing" on the index.
>

Would that work with non-unique indexes? Anyways, the point I was trying to
make is that there are a similar technical challenges and we could solve it
for WARM as well with your work for finding conflicting  pairs
and then not doing inserts. My thinking currently is that it will lead to
other challenges, especially around vacuum, but I could be wrong.

What I tried to do with initial WARM patch is to show significant
improvement even with just 50% WARM updates, yet keep the patch simple. But
there are of course several things we can do to improve it further and
support other index types.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-24 Thread Pavan Deolasee

On Mon, Oct 24, 2016 at 9:34 PM, Tom Lane  wrote:

>
> SELECT blkno, pg_freespace(oid::regclass, blkno)
> FROM generate_series(pg_relation_size(oid::regclass) /
> current_setting('block_size')::BIGINT,
>  pg_relation_size(oid::regclass, 'fsm') / 2) AS
> blkno
>
> It looks to me like this is approximating the highest block number that
> could possibly have an FSM entry as size of the FSM fork (in bytes)
> divided by 2.  But the FSM stores one byte per block.  There is overhead
> for the FSM search tree, but in a large relation it's not going to be as
> much as a factor of 2.  So I think that to be conservative we need to
> drop the "/ 2".  Am I missing something?
>
>
I went by these comments in fsm_internals.h, which suggest that the
SlotsPerFSMPage are limited to somewhat less than BLCKSZ divided by 2.

/*
 * Number of non-leaf and leaf nodes, and nodes in total, on an FSM page.
 * These definitions are internal to fsmpage.c.
 */
#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
  offsetof(FSMPageData, fp_nodes))

#define NonLeafNodesPerPage (BLCKSZ / 2 - 1)
#define LeafNodesPerPage (NodesPerPage - NonLeafNodesPerPage)

/*
 * Number of FSM "slots" on a FSM page. This is what should be used
 * outside fsmpage.c.
 */
#define SlotsPerFSMPage LeafNodesPerPage


Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-24 Thread Pavan Deolasee

On Mon, Oct 24, 2016 at 9:47 PM, Tom Lane  wrote:

>
>
> Also, we could at least discount the FSM root page and first intermediate
> page, no?  That is, the upper limit could be
>
> pg_relation_size(oid::regclass, 'fsm') / 2 -
> 2*current_setting('block_size')::BIGINT
>
>
+1 for doing that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-19 Thread Pavan Deolasee

On Sun, Feb 19, 2017 at 3:43 PM, Robert Haas  wrote:

> On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane  wrote:
>
> > Ah, nah, scratch that.  If any post-index-build pruning had occurred on a
> > page, the evidence would be gone --- the non-matching older tuples would
> > be removed and what remained would look consistent.  But it wouldn't
> > match the index.  You might be able to find problems if you were willing
> > to do the expensive check on *every* HOT chain, but that seems none too
> > practical.
>
> If the goal is just to detect tuples that aren't indexed and should
> be, what about scanning each index and building a TIDBitmap of all of
> the index entries, and then scanning the heap for non-HOT tuples and
> probing the TIDBitmap for each one?  If you find a heap entry for
> which you didn't find an index entry, go and recheck the index and see
> if one got added concurrently.  If not, whine.
>
>
This particular case of corruption results in a heap tuple getting indexed
by a wrong key (or to be precise, indexed by its old value). So the only
way to detect the corruption is to look at each index key and check if it
matches with the corresponding heap tuple. We could write some kind of self
join that can use a sequential scan and an index-only scan (David Rowley
had suggested something of that sort internally here), but we can't
guarantee index-only scan on a table which is being concurrently updated.
So not sure even that will catch every possible case.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-21 Thread Pavan Deolasee

Hi Tom,

On Wed, Feb 1, 2017 at 3:51 AM, Tom Lane  wrote:

>
> (I'm a little more concerned by Alvaro's apparent position that WARM
> is a done deal; I didn't think so.

Are there any specific aspects of the design that you're not comfortable
with? I'm sure there could be some rough edges in the implementation that
I'm hoping will get handled during the further review process. But if there
are some obvious things I'm overlooking please let me know.

Probably the same question to Andres/Robert who has flagged concerns. On my
side, I've run some very long tests with data validation and haven't found
any new issues with the most recent patches.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Cleanup: avoid direct use of ip_posid/ip_blkid

2017-02-22 Thread Pavan Deolasee

Hello All,

I would like to propose the attached patch which removes all direct
references to ip_posid and ip_blkid members of ItemPointerData struct and
instead use ItemPointerGetOffsetNumber and ItemPointerGetBlockNumber macros
respectively to access these members.

My motivation to do this is to try to free-up some bits from ip_posid
field, given that it can only be maximum 13-bits long. But even without
that, it seems like a good cleanup to me.

One reason why these macros are not always used is because they typically
do assert-validation to ensure ip_posid has a valid value. There are a few
places in code, especially in GIN and also when we are dealing with
user-supplied TIDs when we might get a TID with invalid ip_posid. I've
handled that by defining and using separate macros which skip the
validation. This doesn't seem any worse than what we are already doing.

make check-world with --enable-tap-tests passes.

Comments/suggestions?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_ip_posid_blkid_ref_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Thu, Feb 23, 2017 at 11:30 PM, Bruce Momjian  wrote:

> On Wed, Feb  1, 2017 at 10:46:45AM +0530, Pavan Deolasee wrote:
> > > contains a WARM tuple. Alternate ideas/suggestions and review of
> the
> > design
> > > are welcome!
> >
> > t_infomask2 contains one last unused bit,
> >
> >
> > Umm, WARM is using 2 unused bits from t_infomask2. You mean there is
> another
> > free bit after that too?
>
> We are obviously going to use several heap or item pointer bits for
> WARM, and once we do that it is going to be hard to undo that.  Pavan,
> are you saying you could do more with WARM if you had more bits?  Are we
> sure we have given you all the bits we can?  Do we want to commit to a
> lesser feature because the bits are not available?
>
>
btree implementation is complete as much as I would like (there are a few
TODOs, but no show stoppers), at least for the first release. There is a
free bit in btree index tuple header that I could use for chain conversion.
In the heap tuples, I can reuse HEAP_MOVED_OFF because that bit will only
be set along with HEAP_WARM_TUPLE bit. Since none of the upgraded clusters
can have HEAP_WARM_TUPLE bit set, I think we are safe.

WARM currently also supports hash indexes, but there is no free bit left in
hash index tuple header. I think I can work around that by using a bit from
ip_posid (not yet implemented/tested, but seems doable).

IMHO if we can do that i.e. support btree and hash indexes to start with,
we should be good to go for the first release. We can try to support other
popular index AMs in the subsequent release.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjian  wrote:

> On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> >
> > > As I remember, WARM only allows
> > > a single index-column change in the chain.  Why are you seeing such a
> > > large performance improvement?  I would have thought it would be that
> > > high if we allowed an unlimited number of index changes in the chain.
> >
> > The second update in a chain creates another non-warm-updated tuple, so
> > the third update can be a warm update again, and so on.
>
> Right, before this patch they would be two independent HOT chains.  It
> still seems like an unexpectedly-high performance win.  Are two
> independent HOT chains that much more expensive than joining them via
> WARM?

In these tests, there are zero HOT updates, since every update modifies
some index column. With WARM, we could reduce regular updates to half, even
when we allow only one WARM update per chain (chain really has a single
tuple for this discussion). IOW approximately half updates insert new index
entry in *every* index and half updates
insert new index entry *only* in affected index. That itself does a good
bit for performance.

So to answer your question: yes, joining two HOT chains via WARM is much
cheaper because it results in creating new index entries just for affected
indexes.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 2:13 PM, Robert Haas  wrote:

> On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjian  wrote:
> >
>
> > I have what might be a supid question.  As I remember, WARM only allows
> > a single index-column change in the chain.  Why are you seeing such a
> > large performance improvement?  I would have thought it would be that
> > high if we allowed an unlimited number of index changes in the chain.
>
> I'm not sure how the test case is set up.  If the table has multiple
> indexes, each on a different column, and only one of the indexes is
> updated, then you figure to win because now the other indexes need
> less maintenance (and get less bloated).  If you have only a single
> index, then I don't see how WARM can be any better than HOT, but maybe
> I just don't understand the situation.
>
>
That's correct. If you have just one index and if the UPDATE modifies
indexed indexed, the UPDATE won't be a WARM update and the patch gives you
no benefit. OTOH if the UPDATE doesn't modify any indexed columns, then it
will be a HOT update and again the patch gives you no benefit. It might be
worthwhile to see if patch causes any regression in these scenarios, though
I think it will be minimal or zero.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 12:28 AM, Alvaro Herrera 
wrote:

> Bruce Momjian wrote:
> > On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote:
>
> > > > and potentially trim the first HOT chain as those tuples become
> > > > invisible.
> > >
> > > That can already happen even without WARM, no?
> >
> > Uh, the point is that with WARM those four early tuples can be removed
> > via a prune, rather than requiring a VACUUM. Without WARM, the fourth
> > tuple can't be removed until the index is cleared by VACUUM.
>
> I *think* that the WARM-updated one cannot be pruned either, because
> it's pointed to by at least one index (otherwise it'd have been a HOT
> update).  The ones prior to that can be removed either way.
>
>
No, even the WARM-updated can be pruned and if there are further HOT
updates, those can be pruned too. All indexes and even multiple pointers
from the same index are always pointing to the root of the WARM chain and
that line pointer does not go away unless the entire chain become dead. The
only material difference between HOT and WARM is that since there are two
index pointers from the same index to the same root line pointer, we must
do recheck. But HOT-pruning and all such things remain the same.

Let's take an example. Say, we have a table (a int, b int, c text) and two
indexes on first two columns.

   H  W
   H
(1, 100, 'foo') -> (1, 100, 'bar') --> (1, 200, 'bar') -> (1,
200, 'foo')

The first update will be a HOT update, the second update will be a WARM
update and the third update will again be a HOT update. The first and third
update do not create any new index entry, though the second update will
create a new index entry in the second index. Any further WARM updates to
this chain is not allowed, but further HOT updates are ok.

If all but the last version become DEAD, HOT-prune will remove all of them
and turn the first line pointer into REDIRECT line pointer. At this point,
the first index has one index pointer and the second index has two index
pointers, but all pointing to the same root line pointer, which has not
become REDIRECT line pointer.

   Redirect
o---> (1, 200, 'foo')

I think the part you want (be able to prune the WARM updated tuple) is
> part of what Pavan calls "turning the WARM chain into a HOT chain", so
> not part of the initial patch.  Pavan can explain this part better, and
> also set me straight in case I'm wrong in the above :-)
>
>
Umm.. it's a bit different. Without chain conversion, we still don't allow
further WARM updates to the above chain because that might create a third
index pointer and our recheck logic can't cope up with duplicate scans. HOT
updates are allowed though.

The latest patch that I proposed will handle this case and convert such
chains into regular HOT-pruned chains. To do that, we must remove the
duplicate (and now wrong) index pointer to the chain. Once we do that and
change the state on the heap tuple, we can once again do WARM update to
this tuple. Note that in this example the chain has just one tuple, which
will be the case typically, but the algorithm can deal with the case where
there are multiple tuples but with matching index keys.

Hope this helps.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 3:23 PM, Robert Haas  wrote:

>
> I don't immediately see how this will work with index-only scans.  If
> the tuple is HOT updated several times, HOT-pruned back to a single
> version, and then the page is all-visible, the index entries are
> guaranteed to agree with the remaining tuple, so it's fine to believe
> the data in the index tuple.  But with WARM, that would no longer be
> true, unless you have some trick for that...
>
>
Well the trick is to not allow index-only scans on such pages by not
marking them all-visible. That's why when a tuple is WARM updated, we carry
that information in the subsequent versions even when later updates are HOT
updates. The chain conversion algorithm will handle this by clearing those
bits and thus allowing index-only scans again.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 3:42 PM, Robert Haas  wrote:

>
>
> Wow, OK.  In my view, that makes the chain conversion code pretty much
> essential, because if you had WARM without chain conversion then the
> visibility map gets more or less irrevocably less effective over time,
> which sounds terrible.


Yes. I decided to complete chain conversion patch when I realised that IOS
will otherwise become completely useful if large percentage of rows are
updated just once. So I agree. It's not an optional patch and should get in
with the main WARM patch.


> But it sounds to me like even with the chain
> conversion, it might take multiple vacuum passes before all visibility
> map bits are set, which isn't such a great property (thus e.g.
> fdf9e21196a6f58c6021c967dc5776a16190f295).
>
>
The chain conversion algorithm first converts the chains during vacuum and
then checks if the page can be set all-visible. So I'm not sure why it
would take multiple vacuums before a page is set all-visible. The commit
you quote was written to ensure that we make another attempt to set the
page all-visible after al dead tuples are removed from the page. Similarly,
we will convert all WARM chains to HOT chains and then check for
all-visibility of the page.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 9:47 PM, Robert Haas  wrote:

>
> And there are no bugs, right?  :-)

Yeah yeah absolutely nothing. Just like any other feature committed to
Postgres so far ;-)

I need to polish the chain conversion patch a bit and also add missing
support for redo, hash indexes etc. Support for hash indexes will need
overloading of ip_posid bits in the index tuple (since there are no free
bits left in hash tuples). I plan to work on that next and submit a fully
functional patch, hopefully before the commit-fest starts.

(I have mentioned the idea of overloading ip_posid bits a few times now and
haven't heard any objection so far. Well, that could either mean that
nobody has read those emails seriously or there is general acceptance to
that idea.. I am assuming latter :-))

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Skip all-visible pages during second HeapScan of CIC

2017-02-28 Thread Pavan Deolasee

Hello All,

During the second heap scan of CREATE INDEX CONCURRENTLY, we're only
interested in the tuples which were inserted after the first scan was
started. All such tuples can only exists in pages which have their VM bit
unset. So I propose the attached patch which consults VM during second scan
and skip all-visible pages. We do the same trick of skipping pages only if
certain threshold of pages can be skipped to ensure OS's read-ahead is not
disturbed.

The patch obviously shows significant reduction of time for building index
concurrently for very large tables, which are not being updated frequently
and which was vacuumed recently (so that VM bits are set). I can post
performance numbers if there is interest. For tables that are being updated
heavily, the threshold skipping was indeed useful and without that we saw a
slight regression.

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

Comments?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


cic_skip_all_visible_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cleanup: avoid direct use of ip_posid/ip_blkid

2017-03-05 Thread Pavan Deolasee

On Thu, Mar 2, 2017 at 9:55 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2/22/17 08:38, Pavan Deolasee wrote:
> > One reason why these macros are not always used is because they
> > typically do assert-validation to ensure ip_posid has a valid value.
> > There are a few places in code, especially in GIN and also when we are
> > dealing with user-supplied TIDs when we might get a TID with invalid
> > ip_posid. I've handled that by defining and using separate macros which
> > skip the validation. This doesn't seem any worse than what we are
> > already doing.
>
> I wonder why we allow that.  Shouldn't the tid type reject input that
> has ip_posid == 0?
>

Yes, I think it seems sensible to disallow InvalidOffsetNumber (or >
MaxOffsetNumber) in user-supplied value. But there are places in GIN and
with INSERT ON CONFLICT where we seem to use special values in ip_posid to
mean different things. So we might still need some way to accept invalid
values there.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Skip all-visible pages during second HeapScan of CIC

2017-03-07 Thread Pavan Deolasee

On Wed, Mar 8, 2017 at 7:33 AM, Robert Haas  wrote:

> On Tue, Mar 7, 2017 at 4:26 PM, Stephen Frost  wrote:
> > Right, that's what I thought he was getting at and my general thinking
> > was that we would need a way to discover if a CIC is ongoing on the
> > relation and therefore heap_page_prune(), or anything else, would know
> > that it can't twiddle the bits in the VM due to the ongoing CIC.
> > Perhaps a lock isn't the right answer there, but it would have to be
> > some kind of cross-process communication that operates at a relation
> > level..
>
> Well, I guess that's one option.  I lean toward the position already
> taken by Andres and Peter, namely, that it's probably not a great idea
> to pursue this optimization.

Fair point. I'm not going to "persist" with the idea too long. It seemed
like a good, low-risk feature to me which can benefit certain use cases
quite reasonably. It's not uncommon to create indexes (or reindex existing
indexes to remove index bloats) on extremely large tables and avoiding a
second heap scan can hugely benefit such cases. Holding up the patch for
something for which we don't even have a proposal yet seemed a bit strange
at first, but I see the point.

Anyways, for a recently vacuumed table of twice the size of RAM and on a
machine with SSDs, the patched CIC version runs about 25% faster. That's
probably the best case scenario.

I also realised that I'd attached a slightly older version of the patch. So
new version attached, but I'll remove the patch from CF if I don't see any
more votes in favour of the patch.

Thanks,
Pavan

cic_skip_all_visible_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Skip all-visible pages during second HeapScan of CIC

2017-03-07 Thread Pavan Deolasee

On Wed, Mar 8, 2017 at 8:08 AM, Robert Haas  wrote:

> On Tue, Mar 7, 2017 at 9:12 PM, Andres Freund  wrote:
>
> >
> > I wonder however, if careful snapshot managment couldn't solve this as
> > well.  I have *not* thought a lot about this, but afaics we can easily
> > prevent all-visible from being set in cases it'd bother us by having an
> > "appropriate" xmin / registered snapshot.
>
> Yeah, but that's a tax on the whole system.
>
>
May be we can do that only when patched CIC (or any other operation which
may not like concurrent VM set) is in progress. We could use what Stephen
suggested upthread to find that state. But right now it's hard to think
because there is nothing on either side so we don't know what gets impacted
by aggressive VM set and how.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-14 Thread Pavan Deolasee

On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera 
wrote:

> > @@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation,
> >   scan->heapRelation = heapRelation;
> >   scan->xs_snapshot = snapshot;
> >
> > + /*
> > +  * If the index supports recheck, make sure that index tuple is
> saved
> > +  * during index scans.
> > +  *
> > +  * XXX Ideally, we should look at all indexes on the table and
> check if
> > +  * WARM is at all supported on the base table. If WARM is not
> supported
> > +  * then we don't need to do any recheck.
> RelationGetIndexAttrBitmap() does
> > +  * do that and sets rd_supportswarm after looking at all indexes.
> But we
> > +  * don't know if the function was called earlier in the session
> when we're
> > +  * here. We can't call it now because there exists a risk of
> causing
> > +  * deadlock.
> > +  */
> > + if (indexRelation->rd_amroutine->amrecheck)
> > + scan->xs_want_itup = true;
> > +
> >   return scan;
> >  }
>
> I didn't like this comment very much.  But it's not necessary: you have
> already given relcache responsibility for setting rd_supportswarm.  The
> only problem seems to be that you set it in RelationGetIndexAttrBitmap
> instead of RelationGetIndexList, but it's not clear to me why.

Hmm. I think you're right. Will fix that way and test.

>
> I noticed that nbtinsert.c and nbtree.c have a bunch of new includes
> that they don't actually need.  Let's remove those.  nbtutils.c does
> need them because of btrecheck().

Right. It's probably a left over from the way I wrote the first version.
Will fix.

Speaking of which:
>
> I have already commented about the executor involvement in btrecheck();
> that doesn't seem good.  I previously suggested to pass the EState down
> from caller, but that's not a great idea either since you still need to
> do the actual FormIndexDatum.  I now think that a workable option would
> be to compute the values/isnulls arrays so that btrecheck gets them
> already computed.

I agree with your complaint about modularity violation. What I am unclear
is how passing values/isnulls array will fix that. The way code is
structured currently, recheck routines are called by index_fetch_heap(). So
if we try to compute values/isnulls in that function, we'll still need
access EState, which AFAIU will lead to similar violation. Or am I
mis-reading your idea?

I wonder if we should instead invent something similar to IndexRecheck(),
but instead of running ExecQual(), this new routine will compare the index
values by the given HeapTuple against given IndexTuple. ISTM that for this
to work we'll need to modify all callers of index_getnext() and teach them
to invoke the AM specific recheck method if xs_tuple_recheck flag is set to
true by index_getnext().

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-14 Thread Pavan Deolasee

On Tue, Mar 14, 2017 at 7:19 PM, Alvaro Herrera 
wrote:

> Pavan Deolasee wrote:
>
> > BTW I wanted to share some more numbers from a recent performance test. I
> > thought it's important because the latest patch has fully functional
> chain
> > conversion code as well as all WAL-logging related pieces are in place
> > too. I ran these tests on a box borrowed from Tomas (thanks!).  This has
> > 64GB RAM and 350GB SSD with 1GB on-board RAM. I used the same test setup
> > that I used for the first test results reported on this thread i.e. a
> > modified pgbench_accounts table with additional columns and additional
> > indexes (one index on abalance so that every UPDATE is a potential WARM
> > update).
> >
> > In a test where table + indexes exceeds RAM, running for 8hrs and
> > auto-vacuum parameters set such that we get 2-3 autovacuums on the table
> > during the test, we see WARM delivering more than 100% TPS as compared to
> > master. In this graph, I've plotted a moving average of TPS and the
> spikes
> > that we see coincides with the checkpoints (checkpoint_timeout is set to
> > 20mins and max_wal_size large enough to avoid any xlog-based
> checkpoints).
> > The spikes are more prominent on WARM but I guess that's purely because
> it
> > delivers much higher TPS. I haven't shown here but I see WARM updates
> close
> > to 65-70% of the total updates. Also there is significant reduction in
> WAL
> > generated per txn.
>
> Impressive results.  Labels on axes would improve readability of the chart
> :-)
>
>
Sorry about that. I was desperately searching for Undo button after hitting
"send" for the very same reason :-) Looks like I used gnuplot after a few
years.

Just to make it clear, the X-axis is duration of tests in seconds and
Y-axis is 450s moving average of TPS. BTW 450 is no magic figure. I
collected stats every 15s and took a moving average of last 30 samples.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-15 Thread Pavan Deolasee

On Tue, Mar 14, 2017 at 7:16 PM, Alvaro Herrera 
wrote:

> Pavan Deolasee wrote:
> > On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera <
> alvhe...@2ndquadrant.com>
> > wrote:
>
> > > I have already commented about the executor involvement in btrecheck();
> > > that doesn't seem good.  I previously suggested to pass the EState down
> > > from caller, but that's not a great idea either since you still need to
> > > do the actual FormIndexDatum.  I now think that a workable option would
> > > be to compute the values/isnulls arrays so that btrecheck gets them
> > > already computed.
> >
> > I agree with your complaint about modularity violation. What I am unclear
> > is how passing values/isnulls array will fix that. The way code is
> > structured currently, recheck routines are called by index_fetch_heap().
> So
> > if we try to compute values/isnulls in that function, we'll still need
> > access EState, which AFAIU will lead to similar violation. Or am I
> > mis-reading your idea?
>
> You're right, it's still a problem.  (Honestly, I think the whole idea
> of trying to compute a fake index tuple starting from a just-read heap
> tuple is a problem in itself;

Why do you think so?

> I just wonder if there's a way to do the
> recheck that doesn't involve such a thing.)
>

I couldn't find a better way without a lot of complex infrastructure. Even
though we now have ability to mark index pointers and we know that a given
pointer either points to the pre-WARM chain or post-WARM chain, this does
not solve the case when an index does not receive a new entry. In that
case, both pre-WARM and post-WARM tuples are reachable via the same old
index pointer. The only way we could deal with this is to mark index
pointers as "common", "pre-warm" and "post-warm". But that would require us
to update the old pointer's state from "common" to "pre-warm" for the index
whose keys are being updated. May be it's doable, but might be more complex
than the current approach.

>
> > I wonder if we should instead invent something similar to IndexRecheck(),
> > but instead of running ExecQual(), this new routine will compare the
> index
> > values by the given HeapTuple against given IndexTuple. ISTM that for
> this
> > to work we'll need to modify all callers of index_getnext() and teach
> them
> > to invoke the AM specific recheck method if xs_tuple_recheck flag is set
> to
> > true by index_getnext().
>
> Yeah, grumble, that idea does sound intrusive, but perhaps it's
> workable.  What about bitmap indexscans?  AFAICS we already have a
> recheck there natively, so we only need to mark the page as lossy, which
> we're already doing anyway.
>

Yeah, bitmap indexscans should be ok. We need recheck logic only to avoid
duplicate scans and since a TID can only occur once in the bitmap, there is
no risk for duplicate results.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-19 Thread Pavan Deolasee

On Thu, Mar 16, 2017 at 12:53 PM, Robert Haas  wrote:

> On Wed, Mar 15, 2017 at 3:44 PM, Pavan Deolasee
>  wrote:
> > I couldn't find a better way without a lot of complex infrastructure.
> Even
> > though we now have ability to mark index pointers and we know that a
> given
> > pointer either points to the pre-WARM chain or post-WARM chain, this does
> > not solve the case when an index does not receive a new entry. In that
> case,
> > both pre-WARM and post-WARM tuples are reachable via the same old index
> > pointer. The only way we could deal with this is to mark index pointers
> as
> > "common", "pre-warm" and "post-warm". But that would require us to update
> > the old pointer's state from "common" to "pre-warm" for the index whose
> keys
> > are being updated. May be it's doable, but might be more complex than the
> > current approach.
>
> /me scratches head.
>
> Aren't pre-warm and post-warm just (better) names for blue and red?
>
>
Yeah, sounds better. Just to make it clear, the current design sets the
following information:

HEAP_WARM_TUPLE - When a row gets WARM updated, both old and new versions
of the row are marked with HEAP_WARM_TUPLE flag. This allows us to remember
that a certain row was WARM-updated, even if the update later aborts and we
cleanup the new version and truncate the chain. All subsequent tuple
versions will carry this flag until a non-HOT updates happens, which breaks
the HOT chain.

HEAP_WARM_RED - After first WARM update, the new version of the tuple is
marked with this flag. This flag is also carried forward to all future HOT
updated tuples. So the only tuple that has HEAP_WARM_TUPLE but not
HEAP_WARM_RED is the old version before the WARM update. Also, all tuples
marked with HEAP_WARM_RED flag satisfies HOT property (i.e. all index key
columns share the same value). Similarly, all tuples NOT marked with
HEAP_WARM_RED also satisfy HOT property. I've so far called them Red and
Blue chains respectively.

In addition, in the current patch, the new index pointers resulted from
WARM updates are marked BTREE_INDEX_RED_POINTER/HASH_INDEX_RED_POINTER

I think per your suggestion we can change HEAP_WARM_RED to HEAP_WARM_TUPLE
and similarly rename the index pointers to BTREE/HASH_INDEX_WARM_POINTER
and replace HEAP_WARM_TUPLE with something like HEAP_WARM_UPDATED_TUPLE to
signify that this or some previous version of this chain was once
WARM-updated.

Does that sound ok? I can change the patch accordingly.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-20 Thread Pavan Deolasee

On Mon, Mar 20, 2017 at 8:11 PM, Robert Haas  wrote:

> On Sun, Mar 19, 2017 at 3:05 AM, Pavan Deolasee
>  wrote:
> > On Thu, Mar 16, 2017 at 12:53 PM, Robert Haas 
> wrote:
>
> >>
> >> /me scratches head.
> >>
> >> Aren't pre-warm and post-warm just (better) names for blue and red?
> >>
> >
> > Yeah, sounds better.
>
> My point here wasn't really about renaming, although I do think
> renaming is something that should get done.  My point was that you
> were saying we need to mark index pointers as common, pre-warm, and
> post-warm.  But you're pretty much already doing that, I think.  I
> guess you don't have "common", but you do have "pre-warm" and
> "post-warm".
>
>
Ah, I mis-read that. Strictly speaking, we already have common (blue) and
post-warm (red), and I just finished renaming them to CLEAR (of WARM bit)
and WARM. May be it's still not the best name, but I think it looks better
than before.

But the larger point is that we don't have an easy to know if an index
pointer which was inserted with the original heap tuple (i.e. pre-WARM
update) should only return pre-WARM tuples or should it also return
post-WARM tuples. Right now we make that decision by looking at the
index-keys and discard the pointer whose index-key does not match the ones
created from heap-keys. If we need to change that then at every WARM
update, we will have to go back to the original pointer and change it's
state to pre-warm. That looks more invasive and requires additional index
management.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-20 Thread Pavan Deolasee

On Wed, Mar 15, 2017 at 12:46 AM, Alvaro Herrera 
wrote:

> Pavan Deolasee wrote:
> > On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera <
> alvhe...@2ndquadrant.com>
> > wrote:
>
> > > I have already commented about the executor involvement in btrecheck();
> > > that doesn't seem good.  I previously suggested to pass the EState down
> > > from caller, but that's not a great idea either since you still need to
> > > do the actual FormIndexDatum.  I now think that a workable option would
> > > be to compute the values/isnulls arrays so that btrecheck gets them
> > > already computed.
> >
> > I agree with your complaint about modularity violation. What I am unclear
> > is how passing values/isnulls array will fix that. The way code is
> > structured currently, recheck routines are called by index_fetch_heap().
> So
> > if we try to compute values/isnulls in that function, we'll still need
> > access EState, which AFAIU will lead to similar violation. Or am I
> > mis-reading your idea?
>
> You're right, it's still a problem.



BTW I realised that we don't really need those executor bits in recheck
routines. We don't support WARM when attributes in index expressions are
modified. So we really don't need to do any comparison for those
attributes. I've written a separate form of FormIndexDatum() which will
only return basic index attributes and comparing them should be enough.
Will share rebased and updated patch soon.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-20 Thread Pavan Deolasee

On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera 
wrote:

> > @@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation,
> >   scan->heapRelation = heapRelation;
> >   scan->xs_snapshot = snapshot;
> >
> > + /*
> > +  * If the index supports recheck, make sure that index tuple is
> saved
> > +  * during index scans.
> > +  *
> > +  * XXX Ideally, we should look at all indexes on the table and
> check if
> > +  * WARM is at all supported on the base table. If WARM is not
> supported
> > +  * then we don't need to do any recheck.
> RelationGetIndexAttrBitmap() does
> > +  * do that and sets rd_supportswarm after looking at all indexes.
> But we
> > +  * don't know if the function was called earlier in the session
> when we're
> > +  * here. We can't call it now because there exists a risk of
> causing
> > +  * deadlock.
> > +  */
> > + if (indexRelation->rd_amroutine->amrecheck)
> > + scan->xs_want_itup = true;
> > +
> >   return scan;
> >  }
>
> I didn't like this comment very much.  But it's not necessary: you have
> already given relcache responsibility for setting rd_supportswarm.  The
> only problem seems to be that you set it in RelationGetIndexAttrBitmap
> instead of RelationGetIndexList, but it's not clear to me why.  I think
> if the latter function is in charge, then we can trust the flag more
> than the current situation.


I looked at this today.  AFAICS we don't have access to rd_amroutine in
RelationGetIndexList since we don't actually call index_open() in that
function. Would it be safe to do that? I'll give it a shot, but thought of
asking here first.

Thanks,
Pavan

Re: [HACKERS] Compression of full-page-writes

2014-06-11 Thread Pavan Deolasee

On Wed, Jun 11, 2014 at 4:19 PM, Fujii Masao  wrote:

>
> IIUC even when we adopt only one algorithm, additional at least one bit is
> necessary to see whether this backup block is compressed or not.
>
> This flag is necessary only for backup block, so there is no need to use
> the header of each WAL record. What about just using the backup block
> header?
>
>
+1. We can also steal a few bits from ForkNumber field in the backup block
header if required.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-06-18 Thread Pavan Deolasee

On Wed, Jun 18, 2014 at 6:25 PM, Abhijit Menon-Sen 
wrote:

> At 2014-06-18 18:10:34 +0530, rahilasye...@gmail.com wrote:
> >
> > palloc() is disallowed in critical sections and we are already in CS
> > while executing this code. So we use malloc().
>
> Are these allocations actually inside a critical section? It seems to me
> that the critical section starts further down, but perhaps I am missing
> something.
>
>
ISTM XLogInsert() itself is called from other critical sections. See
heapam.c for example.


> Second, as Andres says, you shouldn't malloc() inside a critical section
> either; and anyway, certainly not without checking the return value.
>
>
I was actually surprised to see Andreas comment. But he is right. OOM
inside CS will result in a PANIC. I wonder if we can or if we really do
enforce that though. The code within #ifdef WAL_DEBUG in the same function
is surely doing a palloc(). That will be caught since there is an assert
inside palloc(). May be nobody tried building with WAL_DEBUG since that
assert was added.

May be Rahila can move that code to InitXLogAccess or even better check for
malloc() return value and proceed without compression. There is code in
snappy.c which will need similar handling, if we decide to finally add that
to core.

> I am not sure if the change will be a significant improvement from
> > performance point of view except it will save few condition checks.
>
> Moving that allocation out of the outer for loop it's currently in is
> *nothing* to do with performance, but about making the code easier to
> read.
>
>
+1.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Add a filed to PageHeaderData

2014-06-24 Thread Pavan Deolasee

On Tue, Jun 24, 2014 at 2:28 PM, Greg Stark  wrote:

> On Tue, Jun 24, 2014 at 12:02 AM, Soroosh Sardari
>  wrote:
> > Is there any rule for adding a field to PageHeaderData?
>
> Not really. It's a pretty internal thing, not something we expect
> people to be doing all the time.
>
> The only rule I can think of is that you should bump some version
> numbers such as the page format version and probably the catalog
> version. But that's probably irrelevant to your problem. It sounds
> like you have a bug in your code but you haven't posted enough
> information to say much more.
>
>
Out of curiosity, I actually tried adding a char[20] field in the page
header because just like you I thought this should be completely internal,
as long as the field is added before the pd_linp[] field. But I get the
same failure that OP is reporting. I wonder if its a bug in gist index
build, though I could not spot anything at the first glance. FWIW changing
the char[] from 20 to 22 or 24 does not cause any failure in rangetypes
test. So I am thinking its some alignment issue (mine is a 64 bit build)

Thanks,
Pavan
-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Add a filed to PageHeaderData

2014-06-24 Thread Pavan Deolasee

On Tue, Jun 24, 2014 at 3:40 PM, Kevin Grittner  wrote:
>
> Soroosh Sardari  wrote:
>
> > I check this problem with a virgin source code of
> > postgresql-9.3.2. So the bug is not for my codes.
>
> > By the way, following code has two different output and it is
> > weird.
>
> I can confirm that I see the difference in 9.3.2, and that I don't
> see the difference in 9.3.4.  Upgrade.
>
> http://www.postgresql.org/support/versioning/
>
> There's really no point in reporting a possible bug on a version
> with known bugs which have already had fixes published.
>

FWIW I can reproduce this on HEAD with the attached patch. I could
reproduce this on a 64-bit Ubuntu as well as 64-bit Mac OSX. Very confusing
it is because I tried with various values for N in char[N] array and it
fails for N=20. Other values I tried are 4, 12, 22, 24 and the test passes
for all of them. The logic for trying other values is to see if pd_linp[]
starting on un-aligned boundary can trigger the issue. But there seem to be
no correlation.

postgres=# select version();

PostgreSQL 9.5devel on x86_64-apple-darwin13.2.0, compiled by Apple LLVM
version 5.1 (clang-503.0.38) (based on LLVM 3.4svn), 64-bit

postgres=# -- test SP-GiST index that's been built incrementally

postgres=# create table test_range_spgist(ir int4range);
postgres=# create index test_range_spgist_idx on test_range_spgist using
spgist (ir);
postgres=# insert into test_range_spgist select int4range(g, g+10) from
generate_series(1,586) g;
INSERT 0 586

postgres=# SET enable_seqscan= t;
postgres=# SET enable_indexscan  = f;
postgres=# SET enable_bitmapscan = f;

postgres=# select * from test_range_spgist where ir -|- int4range(100,500);
ir
---
[90,100)
[500,510)
(2 rows)

postgres=# SET enable_seqscan= f;
postgres=# select * from test_range_spgist where ir -|- int4range(100,500);
ir
---
 [90,100)
 [500,510)
(2 rows)

At this point, both rows are visible via index scan as well as seq scan.

postgres=# insert into test_range_spgist select int4range(g, g+10) from
generate_series(587,587) g;
INSERT 0 1

postgres=# select * from test_range_spgist where ir -|- int4range(100,500);
ir
--
 [90,100)
(1 row)

Ouch. The second row somehow disappeared.

postgres=# SET enable_seqscan= t;
postgres=# select * from test_range_spgist where ir -|- int4range(100,500);
ir
---
 [90,100)
 [500,510)
(2 rows)

So the last INSERT suddenly makes one row disappear via the index scan
though its still reachable via seq scan. I tried looking at the SP-Gist
code but clearly I don't understand it a whole lot to figure out the issue,
if one exists.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


page-header-padding.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Bug in spg_range_quad_inner_consistent for adjacent operator (was Re: [HACKERS] Add a filed to PageHeaderData)

2014-07-01 Thread Pavan Deolasee

On Wed, Jun 25, 2014 at 10:39 PM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:

>
> I came up with the attached. There were several bugs:
>
>
I tested for the original bug report and patch definitely fixes that. I
don't feel qualified enough with SP-Gist to really comment on the other
bugs you reported and presumably fixed in the patch.

Thanks,
Pavan

Re: Bug in spg_range_quad_inner_consistent for adjacent operator (was Re: [HACKERS] Add a filed to PageHeaderData)

2014-07-15 Thread Pavan Deolasee

On Wed, Jul 2, 2014 at 11:11 AM, Pavan Deolasee 
wrote:

> On Wed, Jun 25, 2014 at 10:39 PM, Heikki Linnakangas <
> hlinnakan...@vmware.com> wrote:
>
>>
>> I came up with the attached. There were several bugs:
>>
>>
> I tested for the original bug report and patch definitely fixes that. I
> don't feel qualified enough with SP-Gist to really comment on the other
> bugs you reported and presumably fixed in the patch.
>
>
Heikki, did you get chance to commit your patch? IMHO we should get the bug
fix in before minor releases next week. My apologies if you've already
committed it and I've missed the commit message.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-07-23 Thread Pavan Deolasee

I'm trying to understand what would it take to have this patch in an
acceptable form before the next commitfest. Both Abhijit and Andres has
done some extensive review of the patch and have given many useful
suggestions to Rahila. While she has incorporated most of them, I feel we
are still some distance away from having something which can be committed.
Here are my observations based on the discussion on this thread so far.

1. Need for compressing full page backups:
There are good number of benchmarks done by various people on this list
which clearly shows the need of the feature. Many people have already
voiced their agreement on having this in core, even as a configurable
parameter. There had been some requests to have more benchmarks such as
response times immediately after a checkpoint or CPU consumption which I'm
not entirely sure if already done.

2. Need for different compression algorithms:
There were requests for comparing different compression algorithms such as
LZ4 and snappy. Based on the numbers that Rahila has posted, I can see LZ4
has the best compression ratio, at least for TPC-C benchmarks she tried.
Having said that, I was hoping to see more numbers in terms of CPU resource
utilization which will demonstrate the trade-off, if any. Anyways, there
were also apprehensions expressed about whether to have pluggable algorithm
in the final patch that gets committed. If we do decide to support more
compression algorithms, I like what Andres had done before i.e. store the
compression algorithm information in the varlena header. So basically, we
should have a abstract API which can take a buffer and the desired
algorithm and returns compressed data, along with varlena header with
encoded information. ISTM that the patch Andres had posted earlier was
focused primarily on toast data, but I think we can make it more generic so
that both toast and FPW can use it.

Having said that, IMHO we should go one step at a time. We are using pglz
for compressing toast data for long, so we can continue to use the same for
compressing full page images. We can simultaneously work on adding more
algorithms to core and choose the right candidate for different scenarios
such as toast or FPW based on test evidences. But that work can happen
independent of this patch.

3. Compressing one block vs all blocks:
Andres suggested that compressing all backup blocks in one go may give us
better compression ratio. This is worth trying. I'm wondering what would
the best way to do so without minimal changes to the xlog insertion code.
Today, we add more rdata items for backup block header(s) and backup blocks
themselves (if there is a "hole" then 2 per backup block) beyond what the
caller has supplied. If we have to compress all the backup blocks together,
then one approach is to copy the backup block headers and the blocks to a
temp buffer, compress that and replace the rdata entries added previously
with a single rdata. Is there a better way to handle multiple blocks in one
go?

We still need a way to tell the restore path that the wal data is
compressed. One way is to always add a varlena header irrespective of
whether the blocks are compressed or not. This looks overkill. Another way
to add a new field to XLogRecord to record this information. Looks like we
can do this without increasing the size of the header since there are 2
bytes padding after the xl_rmid field.

4. Handling holes in backup blocks:
I think we address (3) then this can be easily done. Alternatively, we can
also memzero the "hole" and then compress the entire page. The compression
algorithm should handle that well.

Thoughts/comments?

Thanks,
Pavan

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-06 Thread Pavan Deolasee

On Tue, Aug 5, 2014 at 6:25 PM, Fujii Masao  wrote:

>
>
> This gradual approach looks good to me. And, if the additional compression
> algorithm like lz4 is always better than pglz for every scenarios, we can
> just
> change the code so that the additional algorithm is always used. Which
> would
> make the code simpler.
>
>
Right.


> > 3. Compressing one block vs all blocks:
> > Andres suggested that compressing all backup blocks in one go may give us
> > better compression ratio. This is worth trying. I'm wondering what would
> the
> > best way to do so without minimal changes to the xlog insertion code.
> Today,
> > we add more rdata items for backup block header(s) and backup blocks
> > themselves (if there is a "hole" then 2 per backup block) beyond what the
> > caller has supplied. If we have to compress all the backup blocks
> together,
> > then one approach is to copy the backup block headers and the blocks to a
> > temp buffer, compress that and replace the rdata entries added previously
> > with a single rdata.
>
> Basically sounds reasonable. But, how does this logic work if there are
> multiple rdata and only some of them are backup blocks?
>
>
My idea is to just make a pass over the rdata entries past the
rdt_lastnormal element after processing the backup blocks and making
additional entries in the chain. These additional rdata entries correspond
to the backup blocks and their headers. So we can copy the rdata->data of
these elements in a temp buffer and compress the entire thing in one go. We
can then replace the rdata chain past the rdt_lastnormal with a single
rdata with data pointing to the compressed data. Recovery code just needs
to decompress this data the record header indicates that the backup data is
compressed. Sure the exact mechanism to indicate if the data is compressed
(and by which algorithm) can be worked out.


> If a "hole" is not copied to that temp buffer, ISTM that we should
> change backup block header  so that it contains the info for a
> "hole", e.g., location that a "hole" starts. No?
>
>
AFAICS its not required if we compress the stream of BkpBlock and the block
data. The current mechanism of constructing the additional rdata chain
items takes care of hole anyways.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-21 Thread Pavan Deolasee

On Tue, Mar 21, 2017 at 5:34 PM, Robert Haas  wrote:

> On Tue, Mar 21, 2017 at 6:56 AM, Amit Kapila 
> wrote:
> >> Hmm, that test case isn't all that synthetic.  It's just a single
> >> column bulk update, which isn't anything all that crazy, and 5-10%
> >> isn't nothing.
> >>
> >> I'm kinda surprised it made that much difference, though.
> >>
> >
> > I think it is because heap_getattr() is not that cheap.  We have
> > noticed the similar problem during development of scan key push down
> > work [1].
>
> Yeah.  So what's the deal with this?  Is somebody working on figuring
> out a different approach that would reduce this overhead?  Are we
> going to defer WARM to v11?  Or is the intent to just ignore the 5-10%
> slowdown on a single column update and commit everything anyway?

I think I should clarify something. The test case does a single column
update, but it also has columns which are very wide, has an index on many
columns (and it updates a column early in the list). In addition, in the
test Mithun updated all 10million rows of the table in a single
transaction, used UNLOGGED table and fsync was turned off.

TBH I see many artificial scenarios here. It will be very useful if he can
rerun the query with some of these restrictions lifted. I'm all for
addressing whatever we can, but I am not sure if this test demonstrates a
real world usage.

Having said that, may be if we can do a few things to reduce the overhead.

- Check if the page has enough free space to perform a HOT/WARM update. If
not, don't look for all index keys.
- Pass bitmaps separately for each index and bail out early if we conclude
neither HOT nor WARM is possible. In this case since there is just one
index and as soon as we check the second column we know neither HOT nor
WARM is possible, we will return early. It might complicate the API a lot,
but I can give it a shot if that's what is needed to make progress.

Any other ideas?

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-21 Thread Pavan Deolasee

On Tue, Mar 21, 2017 at 6:55 PM, Robert Haas  wrote:

>
> I think that very wide columns and highly indexed tables are not
> particularly unrealistic, nor do I think updating all the rows is
> particularly unrealistic.

Ok. But those who update 10M rows in a single transaction, would they
really notice 5-10% variation? I think it probably makes sense to run those
updates in smaller transactions and see if the regression is still visible
(otherwise tweaking synchronous_commit is mute anyways).

> Sure, it's not everything, but it's
> something.  Now, I would agree that all of that PLUS unlogged tables
> with fsync=off is not too realistic.  What kind of regression would we
> observe if we eliminated those last two variables?
>

Hard to say. I didn't find any regression on the machines available to me
even with the original test case that I used, which was pretty bad case to
start with (sure, Mithun tweaked it further to create even worse scenario).
May be the kind of machines he has access to, it might show up even with
those changes.

>
>
> I think that whether the code ends up getting contorted is an
> important consideration here.  For example, if the first of the things
> you mention can be done without making the code ugly, then I think
> that would be worth doing; it's likely to help fairly often in
> real-world cases.  The problem with making the code contorted and
> ugly, as you say that the second idea would require, is that it can
> easily mask bugs.

Agree. That's probably one reason why Alvaro wrote the patch to start with.
I'll give the first of those two options a try.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-21 Thread Pavan Deolasee

On Tue, Mar 21, 2017 at 10:47 PM, Bruce Momjian  wrote:

> On Tue, Mar 21, 2017 at 01:04:14PM -0400, Robert Haas wrote:
> > > I know we have talked about it, but not recently, and if everyone else
> > > is fine with it, I am too, but I have to ask these questions.
> >
> > I think that's a good question.  I previously expressed similar
> > concerns.  On the one hand, it's hard to ignore the fact that, in the
> > cases where this wins, it already buys us a lot of performance
> > improvement.  On the other hand, as you say (and as I said), it eats
> > up a lot of bits, and that limits what we can do in the future.  On
> > the one hand, there is a saying that a bird in the hand is worth two
> > in the bush.  On the other hand, there is also a saying that one
> > should not paint oneself into the corner.
> >
> > I'm not sure we've had any really substantive discussion of these
> > issues.  Pavan's response to my previous comments was basically "well,
> > I think it's worth it", which is entirely reasonable, because he
> > presumably wouldn't have written the patch that way if he thought it
> > sucked.  But it might not be the only opinion.
>
> Early in the discussion we talked about allowing multiple changes per
> WARM chain if they all changed the same index and were in the same
> direction so there were no duplicates, but it was complicated.  There
> was also discussion about checking the index during INSERT/UPDATE to see
> if there was a duplicate.  However, those ideas never led to further
> discussion.
>

Well, once I started thinking about how to do vacuum etc, I realised that
any mechanism which allows unlimited (even handful) updates per chain is
going to be very complex and error prone. But if someone has ideas to do
that, I am open. I must say though, it will make an already complex problem
even more complex.


>
> I know the current patch yields good results, but only on a narrow test
> case,


Hmm. I am kinda surprised you say that because I never thought it was a
narrow test case that we are targeting here. But may be I'm wrong.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-21 Thread Pavan Deolasee

On Tue, Mar 21, 2017 at 10:34 PM, Robert Haas  wrote:

> On Tue, Mar 21, 2017 at 12:49 PM, Bruce Momjian  wrote:
> > On Tue, Mar 21, 2017 at 09:25:49AM -0400, Robert Haas wrote:
> >> On Tue, Mar 21, 2017 at 8:41 AM, Pavan Deolasee
> >> > TBH I see many artificial scenarios here. It will be very useful if
> he can
> >> > rerun the query with some of these restrictions lifted. I'm all for
> >> > addressing whatever we can, but I am not sure if this test
> demonstrates a
> >> > real world usage.
> >>
> >> That's a very fair point, but if these patches - or some of them - are
> >> going to get committed then these things need to get discussed.  Let's
> >> not just have nothing-nothing-nothing giant unagreed code drop.
> >
> > First, let me say I love this feature for PG 10, along with
> > multi-variate statistics.
> >
> > However, not to be a bummer on this, but the persistent question I have
> > is whether we are locking ourselves into a feature that can only do
> > _one_ index-change per WARM chain before a lazy vacuum is required.  Are
> > we ever going to figure out how to do more changes per WARM chain in the
> > future, and is our use of so many bits for this feature going to
> > restrict our ability to do that in the future.
> >
> > I know we have talked about it, but not recently, and if everyone else
> > is fine with it, I am too, but I have to ask these questions.
>
> I think that's a good question.  I previously expressed similar
> concerns.  On the one hand, it's hard to ignore the fact that, in the
> cases where this wins, it already buys us a lot of performance
> improvement.  On the other hand, as you say (and as I said), it eats
> up a lot of bits, and that limits what we can do in the future.

I think we can save a bit few bits, at some additional costs and/or
complexity. It all depends on what matters us more. For example, we can
choose not to use HEAP_LATEST_TUPLE bit and instead always find the root
tuple the hard way. Since only WARM would ever need to find that
information, may be it's ok since WARM's other advantage will justify that.
Or we cache the information computed during earlier heap_prune_page call
and use that (just guessing that we can make it work, no concrete idea at
this moment).

We can also save HEAP_WARM_UPDATED flag since this is required only for
abort-handling case. We can find a way to push that information down to the
old tuple if UPDATE aborts and we detect the broken chain. Again, not fully
thought through, but doable. Of course, we will have to carefully evaluate
all code paths and make sure that we don't lose that information ever.

If the consumption of bits become a deal breaker then I would first trade
the HEAP_LATEST_TUPLE bit and then HEAP_WARM_UPDATED just from correctness
perspective.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-21 Thread Pavan Deolasee

On Wed, Mar 22, 2017 at 3:51 AM, Mithun Cy 
wrote:

> On Tue, Mar 21, 2017 at 8:10 PM, Robert Haas 
> wrote:
> > If the WAL writing hides the loss, then I agree that's not a big
> > concern.  But if the loss is still visible even when WAL is written,
> > then I'm not so sure.
>
> The tests table schema was taken from earlier tests what Pavan has posted
> [1], hence it is UNLOGGED all I tried to stress the tests. Instead of
> updating 1 row at a time through pgbench (For which I and Pavan both did
> not see any regression), I tried to update all the rows in the single
> statement.
>

Sorry, I did not mean to suggest that you set it up wrongly, I was just
trying to point out that the test case itself may not be very practical.
But given your recent numbers, the regression is clearly non-trivial and
something we must address.

> I have changed the settings as recommended and did a quick test as above
> in our machine by removing UNLOGGED world in create table statement.
>
> Patch Tested : Only 0001_interesting_attrs_v18.patch in [2]
>
> Response time recorded shows there is a much higher increase in response
> time from 10% to 25% after the patch.
>
>
Thanks for repeating the tests. They are very useful. It might make sense
to reverse the order or do 6 tests each and alternate between patched and
unpatched master just to get rid of any other anomaly.

BTW may I request another test with the attached patch? In this patch, we
check if the PageIsFull() even before deciding which attributes to check
for modification. If the page is already full, there is hardly any chance
of doing a HOT update  (there could be a corner case where the new tuple is
smaller than the tuple used in previous UPDATE and we have just enough
space to do HOT update this time, but I can think that's too narrow).

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

0001_interesting_attrs_v19.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Pavan Deolasee

On Wed, Mar 22, 2017 at 8:43 AM, Pavan Deolasee 
wrote:

>
>
> BTW may I request another test with the attached patch? In this patch, we
> check if the PageIsFull() even before deciding which attributes to check
> for modification. If the page is already full, there is hardly any chance
> of doing a HOT update  (there could be a corner case where the new tuple is
> smaller than the tuple used in previous UPDATE and we have just enough
> space to do HOT update this time, but I can think that's too narrow).
>
>
I would also request you to do a slightly different test where instead of
updating the second column, we update the last column of the index i.e.
col9. Would really appreciate if you share results with both master and v19
patch.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Pavan Deolasee

On Wed, Mar 22, 2017 at 3:51 AM, Mithun Cy 
wrote:

>
> CREATE INDEX testindx ON testtab (col1, col2, col3, col4, col5, col6,
> col7, col8, col9);
> Performance measurement tests: Ran12 times to eliminate run to run
> latencies.
> ==
> VACUUM FULL;
> BEGIN;
> UPDATE testtab SET col2 = md5(random()::text);
> ROLLBACK;
>
> Response time recorded shows there is a much higher increase in response
> time from 10% to 25% after the patch.
>
>
After doing some tests on my side, I now think that there is something else
going on, unrelated to the patch. I ran the same benchmark on AWS i2.xlarge
machine with 32GB RAM. shared_buffers set to 16GB, max_wal_size to 256GB,
checkpoint_timeout to 60min and autovacuum off.

I compared master and v19, every time running 6 runs of the test. The
database was restarted whenever changing binaries, tables dropped/recreated
and checkpoint taken after each restart (but not between 2 runs, which I
believe what you did too.. but correct me if that's a wrong assumption).

Instead of col2, I am updating col9, but that's probably not too much
relevant.

VACUUM FULL;
BEGIN;
UPDATE testtab SET col9 = md5(random()::text);
ROLLBACK;

First set of 6 runs with master:
163629.8
181183.8
194788.1
194606.1
194589.9
196002.6

(database restart, table drop/create, checkpoint)
First set of 6 runs with v19:
190566.55
228274.489
238110.202
239304.681
258748.189
284882.4

(database restart, table drop/create, checkpoint)
Second set of 6 runs with master:
232267.5
298259.6
312315.1
341817.3
360729.2
385210.7

This looks quite weird to me. Obviously these numbers are completely
non-comparable. Even the time for VACUUM FULL goes up with every run.

May be we can blame it on AWS instance completely, but the pattern in your
tests looks very similar where the number slowly and steadily keeps going
up. If you do complete retest but run v18/v19 first and then run master,
may be we'll see a complete opposite picture?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Pavan Deolasee

On Wed, Mar 22, 2017 at 4:53 PM, Mithun Cy 
wrote:

> On Wed, Mar 22, 2017 at 3:44 PM, Pavan Deolasee
>  wrote:
> >
> > This looks quite weird to me. Obviously these numbers are completely
> > non-comparable. Even the time for VACUUM FULL goes up with every run.
> >
> > May be we can blame it on AWS instance completely, but the pattern in
> your
> > tests looks very similar where the number slowly and steadily keeps going
> > up. If you do complete retest but run v18/v19 first and then run master,
> may
> > be we'll see a complete opposite picture?
> >
>
> For those tests I have done tests in the order ---  patch18, Master> both the time numbers were same.

Hmm, interesting.

> One different thing
> I did was I was deleting the data directory between tests and creating
> the database from scratch. Unfortunately the machine I tested this is
> not available. I will test same with v19 once I get the machine and
> report you back.

Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB
RAM, attached SSD) and results are shown below. But I think it is important
to get independent validation from your side too, just to ensure I am not
making any mistake in measurement. I've attached naively put together
scripts which I used to run the benchmark. If you find them useful, please
adjust the paths and run on your machine.

I reverted back to UNLOGGED table because with WAL the results looked very
weird (as posted earlier) even when I was taking a CHECKPOINT before each
set and had set max_wal_size and checkpoint_timeout high enough to avoid
any checkpoint during the run. Anyways, that's a matter of separate
investigation and not related to this patch.

I did two kinds of tests.
a) update last column of the index
b) update second column of the index

v19 does considerably better than even master for the last column update
case and pretty much inline for the second column update test. The reason
is very clear because v19 determines early in the cycle that the buffer is
already full and there is very little chance of doing a HOT update on the
page. In that case, it does not check any columns for modification. The
master on the other hand will scan through all 9 columns (for last column
update case) and incur the same kind of overhead of doing wasteful work.

The first/second/fourth column shows response time in ms and third and
fifth column shows percentage difference over master. (I hope the table
looks fine, trying some text-table generator tool :-). Apologies if it
looks messed up)

+---+
|  Second column update |
+---+
|   Master  | v18 | v19 |
+---+-+-+
| 96657.681 | 108122.868 | 11.86% | 96873.642  | 0.22%  |
+---+++++
| 98546.35  | 110021.27  | 11.64% | 97569.187  | -0.99% |
+---+++++
| 99297.231 | 110550.054 | 11.33% | 100241.261 | 0.95%  |
+---+++++
| 97196.556 | 110235.808 | 13.42% | 97216.207  | 0.02%  |
+---+++++
| 99072.432 | 110477.327 | 11.51% | 97950.687  | -1.13% |
+---+++++
| 96730.056 | 109217.939 | 12.91% | 96929.617  | 0.21%  |
+---+++++

+---+
|   Last column update  |
+---+
|   Master   | v18| v19 |
+++-+
| 112545.537 | 116563.608 | 3.57% | 103067.276 | -8.42% |
+++---+++
| 110169.224 | 115753.991 | 5.07% | 104411.44  | -5.23% |
+++---+++
| 112280.874 | 116437.11  | 3.70% | 104868.98  | -6.60% |
+++---+++
| 113171.556 | 116813.262 | 3.22% | 103907.012 | -8.19% |
+++---+++
| 110721.881 | 117442.709 | 6.07% | 104124.131 | -5.96% |
+++---+++
| 112138.601 | 115834.549 | 3.30% | 104858.624 | -6.49% |
+++---+++

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

interest_attrs_tests.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-23 Thread Pavan Deolasee

Thanks Amit. v19 addresses some of the comments below.

On Thu, Mar 23, 2017 at 10:28 AM, Amit Kapila 
wrote:

> On Wed, Mar 22, 2017 at 4:06 PM, Amit Kapila 
> wrote:
> > On Tue, Mar 21, 2017 at 6:47 PM, Pavan Deolasee
> >  wrote:
> >>
> >>>
> >>
> >> Please find attached rebased patches.
> >>
> >
> > Few comments on 0005_warm_updates_v18.patch:
> >
>
> Few more comments on 0005_warm_updates_v18.patch:
> 1.
> @@ -234,6 +241,25 @@ index_beginscan(Relation heapRelation,
>   scan->heapRelation = heapRelation;
>   scan->xs_snapshot = snapshot;
>
> + /*
> + * If the index supports recheck,
> make sure that index tuple is saved
> + * during index scans. Also build and cache IndexInfo which is used by
> + * amrecheck routine.
> + *
> + * XXX Ideally, we should look at
> all indexes on the table and check if
> + * WARM is at all supported on the base table. If WARM is not supported
> + * then we don't need to do any recheck.
> RelationGetIndexAttrBitmap() does
> + * do that and sets rd_supportswarm after looking at all indexes. But we
> + * don't know if the function was called earlier in the
> session when we're
> + * here. We can't call it now because there exists a risk of causing
> + * deadlock.
> + */
> + if (indexRelation->rd_amroutine->amrecheck)
> + {
> +scan->xs_want_itup = true;
> + scan->indexInfo = BuildIndexInfo(indexRelation);
> + }
> +
>
> Don't we need to do this rechecking during parallel scans?  Also what
> about bitmap heap scans?
>
>
Yes, we need to handle parallel scans. Bitmap scans are not a problem
because it can never return the same TID twice. I fixed this though by
moving this inside index_beginscan_internal.


> 2.
> +++ b/src/backend/access/nbtree/nbtinsert.c
> -
>  typedef struct
>
> Above change is not require.
>
>
Sure. Fixed.


> 3.
> +_bt_clear_items(Page page, OffsetNumber *clearitemnos, uint16 nclearitems)
> +void _hash_clear_items(Page page, OffsetNumber *clearitemnos,
> +   uint16 nclearitems)
>
> Both the above functions look exactly same, isn't it better to have a
> single function like page_clear_items?  If you want separation for
> different index types, then we can have one common function which can
> be called from different index types.
>
>
Yes, makes sense. Moved that to bufpage.c. The reason why I originally had
index-specific versions because I started by putting WARM flag in
IndexTuple header. But since hash index does not have the bit free, moved
everything to TID bit-flag. I still left index-specific wrappers, but they
just call PageIndexClearWarmTuples.


> 4.
> - if (callback(htup, callback_state))
> + flags = ItemPointerGetFlags(&itup->t_tid);
> + is_warm = ((flags & BTREE_INDEX_WARM_POINTER) != 0);
> +
> + if (is_warm)
> + stats->num_warm_pointers++;
> + else
> + stats->num_clear_pointers++;
> +
> + result = callback(htup, is_warm, callback_state);
> + if (result == IBDCR_DELETE)
> + {
> + if (is_warm)
> + stats->warm_pointers_removed++;
> + else
> + stats->clear_pointers_removed++;
>
> The patch looks to be inconsistent in collecting stats for btree and
> hash.  I don't see above stats are getting updated in hash index code.
>
>
Fixed. The hashbucketcleanup signature is just getting a bit too long. May
be we should move some of these counters in a structure and pass that
around. Not done here though.


> 5.
> +btrecheck(Relation indexRel, IndexInfo *indexInfo, IndexTuple indexTuple,
> + Relation heapRel, HeapTuple heapTuple)
> {
> ..
> + if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
> + att->attlen))
> ..
> }
>
> Will this work if the index is using non-default collation?
>
>
Not sure I understand that. Can you please elaborate?


> 6.
> +++ b/src/backend/access/nbtree/nbtxlog.c
> @@ -390,83 +390,9 @@ btree_xlog_vacuum(XLogReaderState *record)
> -#ifdef UNUSED
>   xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
>
>   /*
> - * This section of code is thought to be no longer needed, after analysis
> - * of the calling paths. It is retained to allow the code to be reinstated
> - * if a flaw is revealed in that thinking.
> - *
> ..
>
> Why does this patch need to remove the above code under #ifdef UNUSED
>
>
Yeah, it isn't strictly necessary. But that dead code was coming in the way
and hence I decided to strip it out. I can put it back if it's an issue or
remove that as a separate commit first.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-23 Thread Pavan Deolasee

On Thu, Mar 23, 2017 at 3:02 PM, Amit Kapila 
wrote:

>
>
> That sounds like you are dodging the actual problem.  I mean you can
> put that same PageIsFull() check in master code as well and then you
> will most probably again see the same regression.

Well I don't see it that way. There was a specific concern about a specific
workload that WARM might regress. I think this change addresses that. Sure
if you pick that one piece, put it in master first and then compare against
rest of the WARM code, you will see a regression. But I thought what we
were worried is WARM causing regression to some existing user, who might
see her workload running 10% slower, which this change mitigates.

> Also, I think if we
> test at fillfactor 80 or 75 (which is not unrealistic considering an
> update-intensive workload), then we might again see regression.

Yeah, we might, but it will be lesser than before, may be 2% instead of
10%. And by doing this we are further narrowing an already narrow test
case. I think we need to see things in totality and weigh in costs-benefits
trade offs. There are numbers for very common workloads, where WARM may
provide 20, 30 or even more than 100% improvements.

Andres and Alvaro already have other ideas to address this problem even
further. And as I said, we can pass-in index specific information and make
that routine bail-out even earlier. We need to accept that WARM will need
to do more attr checks than master, especially when there are more than 1
indexes on the table,  and sometimes those checks will go waste. I am ok if
we want to provide table-specific knob to disable WARM, but not sure if
others would like that idea.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-23 Thread Pavan Deolasee

On Thu, Mar 23, 2017 at 4:08 PM, Pavan Deolasee 
wrote:

>
>
> On Thu, Mar 23, 2017 at 3:02 PM, Amit Kapila 
> wrote:
>
>>
>>
>> That sounds like you are dodging the actual problem.  I mean you can
>> put that same PageIsFull() check in master code as well and then you
>> will most probably again see the same regression.
>
>
> Well I don't see it that way. There was a specific concern about a
> specific workload that WARM might regress. I think this change addresses
> that. Sure if you pick that one piece, put it in master first and then
> compare against rest of the WARM code, you will see a regression.
>

BTW the PageIsFull() check may not help as much in master as it does with
WARM. In master we anyways bail out early after couple of column checks. In
master it may help to reduce the 10% drop that we see while updating last
index column, but if we compare master and WARM with the patch applied,
regression should be quite nominal.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-23 Thread Pavan Deolasee

On Thu, Mar 23, 2017 at 11:44 PM, Mithun Cy 
wrote:

> Hi Pavan,
> On Thu, Mar 23, 2017 at 12:19 AM, Pavan Deolasee
>  wrote:
> > Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB
> > RAM, attached SSD) and results are shown below. But I think it is
> important
> > to get independent validation from your side too, just to ensure I am not
> > making any mistake in measurement. I've attached naively put together
> > scripts which I used to run the benchmark. If you find them useful,
> please
> > adjust the paths and run on your machine.
>
> I did a similar test appears. Your v19 looks fine to me, it does not
> cause any regression, On the other hand, I also ran tests reducing
> table fillfactor to 80 there I can see a small regression 2-3% in
> average when updating col2 and on updating col9 again I do not see any
> regression.
>
>
Thanks Mithun for repeating the tests and confirming that the v19 patch
looks ok.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-23 Thread Pavan Deolasee

On Thu, Mar 23, 2017 at 7:53 PM, Amit Kapila 
wrote:

> On Thu, Mar 23, 2017 at 3:44 PM, Pavan Deolasee
>
> >
> > Yes, this is a very fair point. The way I proposed to address this
> upthread
> > is by introducing a set of threshold/scale GUCs specific to WARM. So
> users
> > can control when to invoke WARM cleanup. Only if the WARM cleanup is
> > required, we do 2 index scans. Otherwise vacuum will work the way it
> works
> > today without any additional overhead.
> >
>
> I am not sure on what basis user can set such parameters, it will be
> quite difficult to tune those parameters.  I think the point is
> whatever threshold we keep, once that is crossed, it will perform two
> scans for all the indexes.

Well, that applies to even vacuum parameters, no? The general sense I've
got here is that we're ok to push some work in background if it helps the
real-time queries, and I kinda agree with that. If WARM improves things in
a significant manner even with these additional maintenance work, it's
still worth doing.

Having said that, I see many ways we can improve on this later. Like we can
track somewhere else information about tuples which may have received WARM
updates (I think it will need to be a per-index bitmap or so) and use that
to do WARM chain conversion in a single index pass. But this is clearly not
PG 10 material.

>   IIUC, this conversion of WARM chains is
> required so that future updates can be WARM or is there any other
> reason?  I see this as a big penalty for future updates.
>

It's also necessary for index-only-scans. But I don't see this as a big
penalty for future updates, because if there are indeed significant WARM
updates then not preparing for future updates will result in
write-amplification, something we are trying to solve here and something
which seems to be showing good gains.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-24 Thread Pavan Deolasee

On Fri, Mar 24, 2017 at 4:04 PM, Amit Kapila 
wrote:

> On Fri, Mar 24, 2017 at 12:25 AM, Pavan Deolasee
>  wrote:
> >
> >
> > On Thu, Mar 23, 2017 at 7:53 PM, Amit Kapila 
> >
> > The general sense I've got
> > here is that we're ok to push some work in background if it helps the
> > real-time queries, and I kinda agree with that.
> >
>
> I don't think we can define this work as "some" work, it can be a lot
> of work depending on the number of indexes.  Also, I think for some
> cases it will generate maintenance work without generating benefit.
> For example, when there is one index on a table and there are updates
> for that index column.
>
>
That's a fair point. I think we can address this though. At the end of
first index scan we would know how many warm pointers the index has and
whether it's worth doing a second scan. For the case you mentioned, we will
do a second scan just on that one index and skip on all other indexes and
still achieve the same result. On the other hand, if one index receives
many updates and other indexes are rarely updated then we might leave
behind a few WARM chains behind and won't be able to do IOS on those pages.
But given the premise that other indexes are receiving rare updates, it may
not be a problem. Note: the code is not currently written that way, but it
should be a fairly small change.

The other thing that we didn't talk about is that vacuum will need to track
dead tuples and warm candidate chains separately which increases memory
overhead. So for very large tables, and for the same amount of
maintenance_work_mem, one round of vacuum will be able to clean lesser
pages. We can work out more compact representation, but something not done
currently.


>
> > But this is clearly not
> > PG 10 material.
> >
>
> I don't see much discussion about this aspect of the patch, so not
> sure if it is acceptable to increase the cost of vacuum.  Now, I don't
> know if your idea of GUC's make it such that the additional cost will
> occur seldom and this additional pass has a minimal impact which will
> make it acceptable.


Yeah, I agree. I'm trying to schedule some more benchmarks, but any help is
appreciated.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-24 Thread Pavan Deolasee

On Fri, Mar 24, 2017 at 6:46 PM, Amit Kapila 
wrote:

>
>
> I was worried for the case if the index is created non-default
> collation, will the datumIsEqual() suffice the need.  Now again
> thinking about it, I think it will because in the index tuple we are
> storing the value as in heap tuple.  However today it occurred to me
> how will this work for toasted index values (index value >
> TOAST_INDEX_TARGET).  It is mentioned on top of datumIsEqual() that it
> probably won't work for toasted values.  Have you considered that
> point?
>
>
No, I haven't and thanks for bringing that up. And now that I think more
about it and see the code, I think the naive way of just comparing index
attribute value against heap values is probably wrong. The example of
TOAST_INDEX_TARGET is one such case, but I wonder if there are other
varlena attributes that we might store differently in heap and index. Like
index_form_tuple() ->heap_fill_tuple seem to some churning for varlena.
It's not clear to me if index_get_attr will return the values which are
binary comparable to heap values.. I wonder if calling index_form_tuple on
the heap values, fetching attributes via index_get_attr on both index
tuples and then doing a binary compare is a more robust idea. Or may be
that's just duplicating efforts.

While looking at this problem, it occurred to me that the assumptions made
for hash indexes are also wrong :-( Hash index has the same problem as
expression indexes have. A change in heap value may not necessarily cause a
change in the hash key. If we don't detect that, we will end up having two
hash identical hash keys with the same TID pointer. This will cause the
duplicate key scans problem since hashrecheck will return true for both the
hash entries. That's a bummer as far as supporting WARM for hash indexes is
concerned, unless we find a way to avoid duplicate index entries.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-25 Thread Pavan Deolasee

On Sat, 25 Mar 2017 at 11:03 PM, Peter Geoghegan  wrote:

> On Sat, Mar 25, 2017 at 12:54 AM, Amit Kapila 
> wrote:
> > I am not sure how do you want to binary compare two datums, if you are
> > thinking datumIsEqual(), that won't work.  I think you need to use
> > datatype specific compare function something like what we do in
> > _bt_compare().
>
> How will that interact with types like numeric, that have display
> scale or similar?
>
>  I wonder why Amit thinks that datumIsEqual won't work once we convert the
heap values to index tuple and then fetch using index_get_attr. After all
that's how the current index tuple was constructed when it was inserted. In
fact, we must not rely on _bt_compare because that might return "false
positive" even for two different heap binary values  (I think). To decide
whether to do WARM update or not in heap_update we only rely on binary
comparison. Could it happen that for two different binary heap values, we
still compute the same index attribute? Even when expression indexes are
not supported?

Thanks,
Pavan



> --
> Peter Geoghegan
>
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Stale comments in vacuumlazy.c

2017-03-26 Thread Pavan Deolasee

I happened to notice a stale comment at the very beginning of vacuumlazy.c.
ISTM we forgot to fix that when we introduced FSM. With FSM, vacuum no
longer needed to track per-page free space info. I propose attached fix.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


vacuumlazy_comment_fixes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-27 Thread Pavan Deolasee

On Tue, Mar 28, 2017 at 1:32 AM, Alvaro Herrera 
wrote:

> Is the WARM tap test suite supposed to work when applied without all the
> other patches?  I just tried applied that one and running "make check -C
> src/test/modules", and it seems to hang after giving "ok 5" for
> t/002_warm_stress.pl.  (I had to add a Makefile too, attached.)
>
>
These tests should run without WARM. I wonder though if IPC::Run's
start/pump/finish facility is fully portable. Andrew on off-list
conversation reminded me that there are no (or may be one) tests currently
using that in Postgres. I've run these tests on OSX, will try on some linux
platform too.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-27 Thread Pavan Deolasee

On Tue, Mar 28, 2017 at 1:59 AM, Robert Haas  wrote:

> On Thu, Mar 23, 2017 at 2:47 PM, Pavan Deolasee
>  wrote:
> > It's quite hard to say that until we see many more benchmarks. As author
> of
> > the patch, I might have got repetitive with my benchmarks. But I've seen
> > over 50% improvement in TPS even without chain conversion (6 indexes on
> a 12
> > column table test).
>
> This seems quite mystifying.  What can account for such a large
> performance difference in such a pessimal scenario?  It seems to me
> that without chain conversion, WARM can only apply to each row once
> and therefore no sustained performance improvement should be possible
> -- unless rows are regularly being moved to new blocks, in which case
> those updates would "reset" the ability to again perform an update.
> However, one would hope that most updates get done within a single
> block, so that the row-moves-to-new-block case wouldn't happen very
> often.
>
>
I think you're confusing between update chains that stay within a block vs
HOT/WARM chains. Even when the entire update chain stays within a block, it
can be made up of multiple HOT/WARM chains and each of these chains offer
ability to do one WARM update. So even without chain conversion, every
alternate update will be a WARM update. So the gains are perpetual.

For example, if I take a simplistic example of a table with just one tuple
and four indexes and where every update updates just one of the indexes.
Assuming no WARM chain conversion this is what would happen for every
update:

1. WARM update, new entry in just one index
2. Regular update, new entries in all indexes
3. WARM update, new entry in just one index
4. Regular update, new entries in all indexes

At the end of N updates (assuming all fit in the same block), one index
will have N entries and rest will have N/2 entries.

Compare that against master:
1. Regular update, new entries in all indexes
2. Regular update, new entries in all indexes
3. Regular update, new entries in all indexes
4. Regular update, new entries in all indexes

At the end of N updates (assuming all fit in the same block), all indexes
will have N entries.  So with WARM we reduce bloats in 3 indexes. And WARM
works almost in a perpetual way even without chain conversion. If you see
the graph I earlier shared (attach again), without WARM chain conversion
the rate of WARM updates settle down to 50%, which is not surprising given
what I explained above.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-27 Thread Pavan Deolasee

On Tue, Mar 28, 2017 at 7:49 AM, Bruce Momjian  wrote:

> On Mon, Mar 27, 2017 at 04:29:56PM -0400, Robert Haas wrote:
> > On Thu, Mar 23, 2017 at 2:47 PM, Pavan Deolasee
> >  wrote:
> > > It's quite hard to say that until we see many more benchmarks. As
> author of
> > > the patch, I might have got repetitive with my benchmarks. But I've
> seen
> > > over 50% improvement in TPS even without chain conversion (6 indexes
> on a 12
> > > column table test).
> >
> > This seems quite mystifying.  What can account for such a large
> > performance difference in such a pessimal scenario?  It seems to me
> > that without chain conversion, WARM can only apply to each row once
> > and therefore no sustained performance improvement should be possible
> > -- unless rows are regularly being moved to new blocks, in which case
> > those updates would "reset" the ability to again perform an update.
> > However, one would hope that most updates get done within a single
> > block, so that the row-moves-to-new-block case wouldn't happen very
> > often.
> >
> > I'm perplexed.
>
> Yes, I asked the same question in this email:
>
> https://www.postgresql.org/message-id/2017032119.
> ge16...@momjian.us
>
>
And I've answered it so many times by now :-)

Just to add more to what I just said in another email, note that HOT/WARM
chains are created when a new root line pointer is created in the heap (a
line pointer that has an index pointing to it). And a new root line pointer
is created when a non-HOT/non-WARM update is performed. As soon as you do a
non-HOT/non-WARM update, the next update can again be a WARM update even
when everything fits in a single block.

That's why for a workload which doesn't do HOT updates and where not all
index keys are updated, you'll find every alternate update to a row to be a
WARM update, even when there is no chain conversion. That itself can save
lots of index bloat, reduce IO on the index and WAL.

Let me know if its still not clear and I can draw some diagrams to explain
it.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

1 2 3 4 5 6 7 8 >

1 - 100 of 740 matches

Mail list logo