There might be a patch available for this already. In the worst case
articulated above (less than 64 columns), if all the nulls are trailing
nulls, the bitmap need not be saved. Actually it is not 64(actually 72), as
postgres heaptupleheader is only 23 bytes and one byte is left for the
start of
I feel sad, that i followed this topic very late. But i still want to put
forward my views.
Have we thought on the lines of how Robert has implemented relation level
locks. In short it should go like this
a) The locks for enforcing Referential integrity should be taken only when
the rarest of the
Insert, Update and Delete don't take locks they simply mark the tuples
they change with an xid. Anybody else wanting to wait on the lock
just waits on the xid. We do insert a lock row for each xid, but not
one per row changed.
I mean the foreign key checks here. They take a Select for Share
Please explain in detail your idea of how it will work.
OK. I will try to explain the abstract idea, i have.
a) Referential integrity gets violated, when there are referencing key
values, not present in the referenced key values. We are maintaining the
integrity by taking a Select for Share
No, I don't think it will all be in memory - but that's part of the
performance calculation. If you need to check on the status of an XID
and find that you need to read a page of data in from disk, that's
going to be many orders of magnitude slower than anything we do with s
snapshot now.
On Tue, Aug 23, 2011 at 5:25 AM, Robert Haas robertmh...@gmail.com wrote:
I've been giving this quite a bit more thought, and have decided to
abandon the scheme described above, at least for now. It has the
advantage of avoiding virtually all locking, but it's extremely
inefficient in its
There are extensive comments on this in visibilitymap.c and, in
heapam.c, heap_xlog_visible().
I went through the design again and again. I am convinced that this should
not have any functional bugs and should not cause much performance issues.
Nice thoughts on bypassing the WAL Logging.
The all_visible_cleared flag is included in the WAL record of the insert
(or update or delete). Partial page writes are not a problem, because we
always fetch the VM page and clear the bit, regardless of the LSN on the
VM page.
Two things
a) First, my understanding of checkpoint is that it
a) First, my understanding of checkpoint is that it flushes all the pages,
that got changed below a particular LSN. If we are not having a LSN in the
visibility map, how we will be sure, that a visibility map page is getting
flushed/not?
Please ignore this statement. I confused between the
The above could already happen in 8.4, where the visibility map was
introduced. The contention on the VM buffer would be just as bad whether you
hold the heap page lock at the same time or not. I have not heard any
complaints of contention on VM buffers.
--
Heikki Linnakangas
a) First
Hmm, you have a point. If 100 backends simultaneously write to 100
different pages, and all of those pages are all-visible, then it's
possible that they could end up fighting over the buffer content lock
on the visibility map page. But why would you expect that to matter?
In a heavily
On Sat, Aug 20, 2011 at 4:48 PM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
The above could already happen in 8.4, where the visibility map was
introduced. The contention on the VM buffer would be just as bad whether you
hold the heap page lock at the same time or not. I have
I think that you have switched gears here and are in this paragraph
talking about the 8.4-era visibility map changes rather than my recent
work. There is zero evidence that those changes caused any
performance problem. I've spent a large chunk of the last four months
looking for
On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
by your argument, if WALInserLock is held for 't' seconds, you should
definitely be holding visibility map lock for more than time frame 't'.
Nope, that's not how it works. Perhaps you should read
By your argument, we can say that no-one will create a index with a
function
like random(), time(), date(), broken operators etc. Its hard to imagine
a
index in which a a user will only want to insert and never select on it.
The whole point of this optimization is to make index access
Note that we already have the visibility map, and the accesses needed to
update it are already there. Granted, we'll have to change the logic
slightly to make it crash safe, but I don't expect that to add any
meaningful overhead - the changes are going to be where the bits are set,
ie.
Well, that would certainly be alarming if true, but I don't think it
is. As far as I can see, the overhead of making the visibility map
crash-safe is just (1) a very small percentage increase in the work
being done by VACUUM and (2) a slight possibility of extra work done
by a foreground
Well, that would certainly be alarming if true, but I don't think it
is. As far as I can see, the overhead of making the visibility map
crash-safe is just (1) a very small percentage increase in the work
being done by VACUUM and (2) a slight possibility of extra work done
by a foreground
On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:
If you are following the same design that Heikki put forward, then there
is
a problem with it in maintaining the bits in page and the bits
On Sat, Aug 20, 2011 at 2:51 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:
If you are following the same design that Heikki put
However, for small operations it's a net loss - you avoid writing a WAL
record, but you have to fsync() the heap instead. If you only modify a few
rows, the extra fsync (or fsyncs if there are indexes too) is more expensive
than writing the WAL.
We'd need a heuristic to decide whether to
All operations that clear the bit area are already WAL-logged.
Is it the case with visibility map also?
Thanks.
robertmh...@gmail.com wrote:
On Wed, Mar 23, 2011 at 2:29 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
All operations that clear the bit area are already WAL-logged.
Is it the case with visibility map also?
Thanks.
Yes. Look at the comment that the patch removes
I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
Patch attached. It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something. Feel free
Similarly using the no. of select hits on a table we can check that if
maximum no. of times it is on a non-index field we can index on that field
to make select faster.
It's impractical to figure out where indexes should go at without
simulating what the optimizer would then do with
I don't think this should involve much code change. But no-one
interested
On Sat, Mar 27, 2010 at 2:23 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
Hi,
Since we insert a new entry into the index for every update that's being
made into the table, we inevitably make a unique
Please write it, then test the performance and publish your results,
with a detailed analysis of whether there is benefit and in which cases
there is a loss.
Thanks. Will do it. Just wanted to know, whether the idea will get
rejected right away / worth trying out.
Thanks,
Gokul.
Hi,
The function outDatum doesn't take care of varlena structures which are
stored externally. Is it the intention??
Thanks,
Gokul.
Hi,
Since we insert a new entry into the index for every update that's being
made into the table, we inevitably make a unique check against the older
version of the newly inserted row, even when the values are not updated. Of
course i am talking about non-HOT updates. (We will not go to the
Hi,
While i was studying the unique index checks very closely, i realized
that what we need is to find out whether the tuple is deleted / not. So say
a tuple is deleted by a transaction, but it is not dead( because of some
long running transaction ), still we can mark a hint bit as deleted and
How are you going to unmark the hint bit in case of a rollback?
Only after you find that the transaction is committed, this hint bit has to
be set. It is equivalent to any other hint bit.
Gokul.
it seems fairly unlikely to me that this would be useful enough to
justify using up a precious hint bit. The applicability of the hint
is very short-term --- as soon as the tuple is dead to all transactions,
it can be marked with the existing LP_DEAD hint bit. And if it's only
useful for
Hi,
With the implementation of deferred unique constraints, we need to go
back to the index second time to check whether the unique check is valid.
Say a situation occurs like this
a) the first session doing the unique check finds out that there is a unique
check required second time and just
Can you also explain how are we avoiding duplicates in this scenario?
a) Say there are three pages(P,Q, R) full of duplicate tuples, that are
deleted but not dead of id x(due to some long running transaction).
b) Now Session A gets in and checks the duplicate tuples for their
liveliness with the
Are you talking about exclusion constraints or btree uniqueness
constraints? This doesn't seem to be a particularly accurate
description of the implementation of either one. The way btree
deals with this is explained in _bt_doinsert:
Unique constraints
* NOTE: obviously,
No, you don't understand how it works. All would-be inserters will hit
the same target page to begin with, ie, the first one that the new key
could legally be inserted on. The lock that protects against this
problem is the lock on that page, regardless of which page the key
actually ends up
T2 : session 1 releases the lock on p1 (its waiting to acquire a ex lock
on
p2)
That's not what we do. See _bt_findinsertloc.
regards, tom lane
I am really confused. Please keep the cool and explain me, if i am
wrong. I could see this code in
Oh! yeah, i got it. Thanks
On Wed, Mar 24, 2010 at 12:26 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
T2 : session 1 releases the lock on p1 (its waiting to acquire a ex lock
on
p2)
That's not what we do. See _bt_findinsertloc.
regards, tom lane
Hi,
I was looking at the VarBit data structure and found out that instead of
storing the number of bits in four bytes, we can save the number of bits
that are valid in the last byte. Since we already store the number of bytes
in Varlena Header, we can calculate the number of valid bits by doing
This might be worth considering in a release cycle where we were going
to break on-disk data compatibility for some other reason. But I can
hardly imagine wanting to do it by itself. Marginal space savings for
the bit types just isn't that exciting.
Maybe we should start a special section
Surely the VM is already update-friendly. If you update a tuple in a
page with the visibility bit set, the bit must be unset or you will get
wrong results.
I was referring in the context of index only scans to skip visibility
checks. I doubt, whether the visibility map feature to skip
On Thu, Mar 18, 2010 at 2:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
Jeff Davis pg...@j-davis.com writes:
There are all kinds of challenges there, but it might be worth thinking
about. Visibility information is highly compressible, and requires
constant maintenance (updates, deletes,
Secondly there's the whole retail vacuum problem -- any
index entries referring to this page would be left dangling unless
there's some kind of retail vacuum or perhaps a page version number.
The issue, we can divide into two
a)volatile functions
b)broken datatypes
For a) I think volatile
I didn't mean that we'd want to compress it to the absolute minimum
size. I had envisioned that it would be a simple scheme designed only to
eliminate long runs of identical visibility information (perhaps only
the frozen and always visible regions would be compressed).
The extra level of
The visibility map itself is already an example of compression. If
visibility information were randomly distributed among tuples, the
visibility map would be nearly useless.
I believe it is very difficult to make visibility map update friendly
without compromising durability. But such a
When we were doing the ordered-aggregates patch, I considered passing
all those values as explicit parameters to transformAggregateCall,
and having it build the Aggref node from scratch and return it.
However having seven or eight parameters to transformAggregateCall
(and more in future if
Hi,
I noticed a problem with the source code of 9.0Alpha 4. In parse_agg.c,
there is a call made to transformSortClause.
00098 torder = transformSortClause
http://doxygen.postgresql.org/parse__clause_8c.html#53199c36a198b5acf15a26fbd7311f79(pstate,
00099
Hi,
I think, this should be the probable fix.
There is agg_order in ParseFuncOrColumn, which should get passed on to
transformAggregateCall and that should be placed in this call, instead of
agg-aggorder.
Thanks,
Gokul.
On Tue, Mar 16, 2010 at 5:19 PM, Gokulakannan Somasundaram
gokul
transformSortClause is passed the untransformed aggorder list, which is
in fact a list of SortBy nodes, and it returns the transformed list
(SortGroupClause nodes), which is stored back into the aggorder field
a bit further down.
There are a number of regression tests that would fail in
I think you have to take up a simpler project as a first project. This
is a major overhaul of transaction information and it depends on
understanding how a lot of different areas work -- all of which are
very complex tricky areas to understand.
Greg,
I just feel the fast full
a) We are already going from table to index to do unique checks. This is
the
same thing, which we will do to go and update the snapshot in the
indexes.
No, it is not the same thing. Updating index snapshots requires being
able to *re-find* a previously made index entry for the current
No, it is not the same thing. Updating index snapshots requires being
able to *re-find* a previously made index entry for the current row.
And it has to be done 100% reliably. The worst that happens if an index
entry is not found when it should be during a uniqueness check is that
the
The transaction information on tuples take 18 bytes plus several info
bits. It's possible just storing a subset of that would be useful but
it's unclear. And I think it would complicate the code if it had to
sometimes fetch the heap tuple to get the rest and sometimes doesn't.
Visibility
If i have got over excited in the previous update, please ignore that.
a) We are already going from table to index to do unique checks. This is the
same thing, which we will do to go and update the snapshot in the indexes.
b) The way, it should work would be to have a check on whether the
To be a bit more concrete: the typical sort of failure that you could
get from broken btree operators is failure of transitivity, that is
the comparators report A B and B C for some A, B, C, but do not say
that A C when those two values are compared directly. I don't see any
convenient
It definitely affects current indexes. We can't completely avoid bad user
functions. That is why it is important to put limits on how much damage they
can do. That's the motivation for the idea I mentioned before, of
double-checking visibility data in an IndexTuple before letting it survive a
Much better to take on a simple project like enabling full
sequential index scans which you claimed you had a solution for and
which is in any case an important sub-problem for IOT.
Greg,
Well i don't think i am ready to take up a project of this size.
But at the same time some
Missed the group..
On Sat, Feb 27, 2010 at 12:00 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
I definitely think thick indexes were too ambitious of a target for a
first time patch. Sequential index scans is very ambitious itself
despite being significantly simpler (if you have
IIRC, what was being talked about was shoehorning some hint bits into
the line pointers by assuming that size and offset are multiples of 4.
I'm not thrilled with having mutable status bits there for reliability
reasons, but it could be done without breaking a lot of existing code.
What I was
It does. The point is that the system is set up to limit the bad
consequences. You might (will) get wrong query answers, but the
heap data won't get corrupted.
Again Tom, if there is an update based on index scan, then it takes the
tupleid and updates the wrong heap data right?
The only
It does. The point is that the system is set up to limit the bad
consequences. You might (will) get wrong query answers, but the
heap data won't get corrupted.
Tom,
if this is our goal - *can return wrong query answers, but
should not corrupt the heap data.* and if we make Thick
No, what generally happens is it fails to find a matching index entry at
all, because the search algorithm concludes there can be no match based
on the limited set of comparisons it's done. Transitivity failures lead
to searching the wrong subset of the index.
Actually Tom, i am not able to
Wait a minute. Bingo So for unique checks we are already going to
index from Heap. So it is the same thing i am doing with Thick index. So if
we can trust our current unique checks, then we should trust the Thick
index.
Thanks Tom!!! for having this good conversation
I think this
Actually Tom, i am not able to understand that completely. But what you are
saying that in the current scenario, when there is a broken data type based
index, then it will return no results, but never will return wrong results.
So never the update will corrupt the heap data. But i take it as
I am just adding my two cents, please ignore it, if its totally irrelevant.
While we do performance testing/tuning of any applications, the important
things, a standard monitoring requirement from a database are
a) Different type of wait events and the time spent in each of them
b) Top ten Queries
Tom,
I just took the patch, but it seems to be in binary format. Can you send
me the patch to me?
Thanks,
Gokul.
On Sat, May 30, 2009 at 3:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
Josh Berkus j...@agliodbs.com writes:
Tom,
Is anyone interested enough to try it if I code it?
If
Yes. When a bit is cleared, that's OK, because a cleared bit just means
you need to check visibility in the heap tuple. When a bit is set,
however, it's important that it doesn't hit the disk before the
corresponding heap page update. That's why visibilitymap_set() sets the
LSN on the page.
The replay of the heap insert/update/delete record updates the
visibility map.
So are you planning to make that section, which writes the xlog and updates
the visibility map inside a PANIC section right?
The replay of the heap insert/update/delete record updates the
visibility map.
Say a checkpoint has occured in between and flushed the dirty pages into
disk, while the updater waits to update the visibility map. Now there will
be no replay for the insert/update/delete right?
1) transaction information in index
This seems like a lot of bloat in indexes. It also means breaking
a lot of other optimizations such as being able to read the tuples
directly from the heap page without locking. I'm not sure how much
those are worth though. But adding 24 bytes to every
Wait a second, which idea are we currently talking about? No heap at
all, or just the ability to check visibility without visiting the heap?
I was talking about the indexes with snapshot
If it's a genuine IOT (ie no separate heap), then you are not going to
be able to get away without a
I disagree with that, Gokul -- if the ordering operators are volatile or
just incorrect, during DELETE, you could set xmax in the wrong IndexTuple.
Then there will be another IndexTuple that says it's visible, but it points
to a non-visible heap tuple. I think you should follow the pointers
Tom,
Actually, if you need to squeeze a few more bits into that word, the
thing to do would be to get rid of storing the tuple length there.
This would involve adding the same type of indirection header that
we use for HeapTuples, so that the length would be available at need
without going
First of all, volatility is not the only issue. The ordering ops could also
be incorrect, e.g., violate the transitivity property. there is no reliable
way to determine if a function is volatile and/or incorrectly specified.
No it is the only issue. If you create a datatype with volatile
No, it's far from harmless. As soon as that heap TID gets filled with
an unrelated tuple, you run the risk of indexscans alighting on and
perhaps modifying the wrong tuple.
Tom,
In the Function based indexes on those functions, which we are
suspecting to be a volatile one Or in the
On Fri, Feb 26, 2010 at 9:54 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
No, it's far from harmless. As soon as that heap TID gets filled with
an unrelated tuple, you run the risk of indexscans alighting on and
perhaps modifying the wrong tuple.
Tom,
i think
No, we're not going there. That breaks the fundamental page content
manipulation algorithms, and falls down for tuples not yet stored in a
page (or being examined without a pointer to the page readily at hand),
and has no redeeming social value anyway compared to doing it in the
proven
Proving that a set of comparison operators are consistent just by
examining their runtime behavior is probably equivalent to solving the
halting problem. I can't see us doing it, or wanting to accept the
overhead of checking it even if it could be done.
The overhead of checking is very
I think Gokul was asking because he wanted to work on it, but wanted to
check community approval first.
Yes the problem is that we need to come to a consensus on broken data types.
As Heikki pointed out, those data types, which is based on a unstable
function like time, date, random etc.
The fragility there is not an issue in a mostly read-only application,
but it definitely would be a concern in other cases.
While we accept that visibility map is good for read only application, why
can't we make it optional? Atleast if there is a way for a person to drop
the visibility map
Forgot to include the group..
On Wed, Feb 24, 2010 at 5:38 PM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
I am not familiar with this term broken data types, and I just looked for
it in the source code and couldn't find it.
What exactly are you referring to?
cheers
andrew
If you have a scenario where the visibility map incurs a measurable
overhead, let's hear it. I didn't see any in the tests I performed, but
it's certainly possible that if the circumstances are just right it
makes a difference.
Heikki,
The obvious one, i could observe is that it
I think you're a barking up the wrong tree. AFAIUI, the need for the
visibility map has not very much to do with whether the table has
indices, and everything to do with avoiding unnecessary VACUUMs. In
any event, you've not shown that the visibility map HAS any overhead,
so talking about
But adding 24 bytes to every index entry seems
pretty unlikely to be a win anyways.
We actually wanted to make it optional. Not every index will be like that.
More than that we can make it into 16 bytes. Only commands within the same
transaction will not be able to do a index only scan.
With an IOT I don't understand how you get out of index corruption
without data loss. That's a showstopper for practical use, I think.
For simplicity, say we are storing all the non-leaf pages of the index in a
seperate file, then the leaf pages are nothing but the table. So if we can
On Wed, Feb 24, 2010 at 10:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Kevin Grittner kevin.gritt...@wicourts.gov writes:
So you are essentially proposing that rather than moving the heap
data into the leaf tuples of the index in the index file, you will
move the leaf index data into the
Yes. The visibility map doesn't need any new WAL records to be written.
We probably will need to add some WAL logging to close the holes with
crash recovery, required for relying on it for index-only-scans, but
AFAICS only for VACUUM and probably only one WAL record for a whole
bunch of heap
Missed the group...
On Thu, Feb 25, 2010 at 12:34 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
On Thu, Feb 25, 2010 at 12:28 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
That doesn't work because when you split an index page any sequential
scan in progress
Missed the group
So basically you want to have index-only scans, but you want them to
be really slow?
No Robert, i just asked him to make it optional, so that the crash safe
visibility map is used only when its need is realized by the DB designer.
When there is a table with no indexes/
It might be slightly easier given the assumption that you only want to
scan leaf tuples.
But there's an additional problem I didn't think of before. Currently
we optimize index scans by copying all relevant tuples to local memory
so we don't need to hold an index lock for an extended time or
That doesn't work because when you split an index page any sequential
scan in progress will either see the same tuples twice or will miss
some tuples depending on where the new page is allocated. Vacuum has a
clever trick for solving this but it doesn't work for arbitrarily many
concurrent
You also need to avoid scanning the same tuple twice. That's not a
problem for VACUUM, but it is for full index scans.
Heikki,
Actually the logic, which i have explained earlier is to avoid
scanning tuples twice.
Gokul.
I haven't thought about whether this is sufficient but if it is then
an initial useful thing to add would be to use it for queries where we
have a qual that can be checked using the index key even though we
can't do a range scan. For example if you have a btree index on
a,b,c and you have a
The WAL record of the heap insert/update/delete contains a flag
indicating that the visibility map needs to be updated too. Thus no need
for a separate WAL record.
Heikki,
Have you considered these cases?
a) The current WAL architecture makes sure that the WAL Log is written
before
On Thu, Feb 25, 2010 at 3:16 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
I haven't thought about whether this is sufficient but if it is then
an initial useful thing to add would be to use it for queries where we
have a qual that can be checked using the index key even though we
On Tue, Feb 23, 2010 at 10:42 AM, Tom Lane t...@sss.pgh.pa.us wrote:
Takahiro Itagaki itagaki.takah...@oss.ntt.co.jp writes:
Instead, how about excluding columns in primary keys from table data?
How will you implement select * from mytable ? Or even
select * from mytable where
.
On Mon, Feb 22, 2010 at 12:21 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Gokulakannan Somasundaram wrote:
Hi,
As you all know, Index Organized tables are a way by which we can
automatically cluster the data based on the primary key. While i was
thinking about
Forgot to include the group...
On Mon, Feb 22, 2010 at 4:31 PM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
These sound like the same point to me. I don't think we're concerned
with footprint -- only with how much of that footprint actually needs
to be scanned. So if we have
Hi,
As you all know, Index Organized tables are a way by which we can
automatically cluster the data based on the primary key. While i was
thinking about an implementation for postgres, it looks like an impossible
with the current ideologies. In an IOT, if a record gets updated, we need to
Hi,
Is there any reason why we have given lesser precedence for postfix
operator compared to multiplication/division? Usually postfix operators have
more precedence than the binary operations. Is this some kind of work around
to provide user-defined operators? Can someone help me understand
1 - 100 of 229 matches
Mail list logo