Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap

2012-04-29 Thread Gokulakannan Somasundaram
There might be a patch available for this already. In the worst case
articulated above (less than 64 columns), if all the nulls are trailing
nulls, the bitmap need not be saved. Actually it is not 64(actually 72), as
postgres heaptupleheader is only 23 bytes and one byte is left for the
start of the bitmap.

The same principle can be considered for Index Tuple as an extension

Thanks,
Gokul.


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-07 Thread Gokulakannan Somasundaram
I feel sad, that i followed this topic very late. But i still want to put
forward my views.
Have we thought on the lines of how Robert has implemented relation level
locks. In short it should go like this

a) The locks for enforcing Referential integrity should be taken only when
the rarest of the events( that would cause the integrity failure) occur.
That would be the update of the referenced column. Other cases of update,
delete and insert should not be required to take locks. In this way, we can
reduce a lot of lock traffic.

So if we have a table like employee( empid, empname, ... depid references
dept(deptid)) and table dept(depid depname).

Currently we are taking shared locks on referenced rows in dept table,
whenever we are updating something in the employee table. This should not
happen. Instead any insert / update of referenced column / delete should
check for some lock in its PGPROC structure, which will only get created
when the depid gets updated / deleted( rare event )

b) But the operation of update of the referenced column will be made more
costly. May be it can create something like a predicate lock(used for
enforcing serializable) and keep it in all the PG_PROC structures.

I know this is a abstract idea, but just wanted to know, whether we have
thought on those lines.

Thanks,
Gokul.


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-07 Thread Gokulakannan Somasundaram


 Insert, Update and Delete don't take locks they simply mark the tuples
 they change with an xid. Anybody else wanting to wait on the lock
 just waits on the xid. We do insert a lock row for each xid, but not
 one per row changed.

I mean the foreign key checks here. They take a Select for Share Lock
right. That's what we are trying to optimize here. Or am i missing
something? So by following the suggested methodology, the foreign key
checks won't take any locks.


 It's worked that way for 5 years, so its too late to modify it now and
 this patch won't change that.

 The way we do RI locking is designed to prevent holding that in memory
 and then having the lock table overflow, which then either requires us
 to revert to the current design or upgrade to table level locks to
 save space in the lock table - which is a total disaster, if you've
 ever worked with DB2.

 What you're suggesting is that we store the locks in memory only as a
 way of avoiding updating the row.

 But that memory would be consumed, only when someone updates the
referenced column( which will usually be the primary key of the referenced
table). Any normal database programmer knows that updating primary key is
not good for performance. So we go by the same logic.


 No, updates of referenced columns are exactly the same as now when no
 RI checks happening.

 If the update occurs when an RI check takes place there is more work
 to do, but previously it would have just blocked and done nothing. So
 that path is relatively heavyweight but much better than nothing.

 As i have already said, that path is definitely heavy weight( like how
Robert has made the DDL path heavy weight). If we assume that DDLs are
going to be a rare phenomenon, then we can also assume that update of
primary keys is a rare phenomenon in a normal database.



 The most useful way to help with this patch right now is to run
 performance investigations and see if there are non-optimal cases. We
 can then see how the patch handles those. Theory is good, but it needs
 to drive experimentation, as I myself re-discover continually.

 I understand. I just wanted to know, whether the developer considered that
line of thought.

Thanks,
Gokul.


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-07 Thread Gokulakannan Somasundaram


 Please explain in detail your idea of how it will work.


OK. I will try to explain the abstract idea, i have.
a) Referential integrity gets violated, when there are referencing key
values, not present in the referenced key values. We are maintaining the
integrity by taking a Select for Share Lock during the foreign key checks,
 so that referred value is not updated/deleted during the operation.

b) We can do the same in the reverse way. When there is a update/delete of
the referred value, we don't want any new inserts with the referred value
in referring table, any update that will update its value to the referred
value being updated/deleted. So we will take some kind of lock, which will
stop such a happening. This can be achieved through
i) predicate locking infrastructure already present (or)
ii) a temporary B-Tree index ( no WAL protection ), that gets created only
for the referred value updations and holds those values that are being
updated/deleted (if we are scared of predicate locking).

So whenever we de foreign key checks, we just need to make sure there is no
such referential integrity lock in our own PGPROC structure(if implemented
with predicate locking) /  check the temporary B-Tree index for any entry
matching the entry that we are going to insert/update to.( the empty tree
can be tracked with a flag to optimize )

May be someone can come up with better ideas than this.

Gokul.


Re: [HACKERS] cheaper snapshots redux

2011-08-28 Thread Gokulakannan Somasundaram
 No, I don't think it will all be in memory - but that's part of the
 performance calculation.  If you need to check on the status of an XID
 and find that you need to read a page of data in from disk, that's
 going to be many orders of magnitude slower than anything we do with s
 snapshot now.  Now, if you gain enough elsewhere, it could still be a
 win, but I'm not going to just assume that.

 I was just suggesting this, because the memory costs have come down a
lot(as you may know) and people can afford to buy more memory in enterprise
scenario. We may not need to worry about MBs of memory, especially with the
cloud computing being widely adopted, when we get scalability.

Gokul.


Re: [HACKERS] cheaper snapshots redux

2011-08-26 Thread Gokulakannan Somasundaram
On Tue, Aug 23, 2011 at 5:25 AM, Robert Haas robertmh...@gmail.com wrote:

 I've been giving this quite a bit more thought, and have decided to
 abandon the scheme described above, at least for now.  It has the
 advantage of avoiding virtually all locking, but it's extremely
 inefficient in its use of memory in the presence of long-running
 transactions.  For example, if there's an open transaction that's been
 sitting around for 10 million transactions or so and has an XID
 assigned, any new snapshot is going to need to probe into the big
 array for any XID in that range.  At 8 bytes per entry, that means
 we're randomly accessing about ~80MB of memory-mapped data.  That
 seems problematic both in terms of blowing out the cache and (on small
 machines) possibly even blowing out RAM.  Nor is that the worst case
 scenario: a transaction could sit open for 100 million transactions.

 First i respectfully disagree with you on the point of 80MB. I would say
that its very rare that a small system( with 1 GB RAM ) might have a long
running transaction sitting idle, while 10 million transactions are sitting
idle. Should an optimization be left, for the sake of a very small system to
achieve high enterprise workloads?

Second, if we make use of the memory mapped files, why should we think, that
all the 80MB of data will always reside in memory? Won't they get paged out
by the  operating system, when it is in need of memory? Or do you have some
specific OS in mind?

Thanks,
Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-24 Thread Gokulakannan Somasundaram


 There are extensive comments on this in visibilitymap.c and, in
 heapam.c, heap_xlog_visible().

 I went through the design again and again. I am convinced that this should
not have any functional bugs and should not cause much performance issues.
Nice thoughts on bypassing the WAL Logging.

Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-21 Thread Gokulakannan Somasundaram
 The all_visible_cleared flag is included in the WAL record of the insert
(or update or delete). Partial page writes are not a problem, because we
 always fetch the VM page and clear the bit, regardless of the LSN on the
VM page.


Two things
a) First, my understanding of checkpoint is that it flushes all the pages,
that got changed below a particular LSN. If we are not having a LSN in the
visibility map, how we will be sure, that a visibility map page is getting
flushed/not? With the following approach, i can see only one issue. If the
heap page gets written and checkpointed and the visibility map doesn't get
synced during the checkpoint, then there is an issue. Can you please explain
me, how we get the assurance?

b) Even if we have a contention issue, Visibility map is a solution for a
considerable number of database scenarios. But it should not become a
default package. A table, with no chance of index-only scans should not
incur the extra overhead of crash safe visibility maps. Those tables should
be sparred from this extra overhead, as they don't have index only scans.

Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-21 Thread Gokulakannan Somasundaram

 a) First, my understanding of checkpoint is that it flushes all the pages,
 that got changed below a particular LSN. If we are not having a LSN in the
 visibility map, how we will be sure, that a visibility map page is getting
 flushed/not?


Please ignore this statement.  I confused between the checkpoints
implemented in Postgres and some other database.

Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-20 Thread Gokulakannan Somasundaram
 The above could already happen in 8.4, where the visibility map was
 introduced. The contention on the VM buffer would be just as bad whether you
 hold the heap page lock at the same time or not. I have not heard any
 complaints of contention on VM buffers.

 --
  Heikki Linnakangas


 a) First of all, it(Visibility Map) should have definitely affected the
scalability of postgres in scenarios where in updates occur during a time
batch window. May be the increase in speed of vacuums negate that effect.
b) Second, currently the index scans  don't touch the visibility map and in
future they are going to acquire share lock on that. This should increase
the contention.
c) Your statement : The contention on the VM buffer would be just as bad
whether you hold the heap page lock at the same time or not.
I am talking about the contention time frame of the heap page. It will be
increased Consider my statement in conjunction with the scenario 2.
d) In addition, currently there is no WAL Logging, while the bit is cleared,
which would not be the case in future and hence the exclusive lock held on
the visibility map is going to be held for a longer time.

You might definitely see some performance improvement, if you are testing
this in anything less than 4 cores. Bump up the core count and processor
count and check whether this affects the load-throughput curve.

Thanks,
Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-20 Thread Gokulakannan Somasundaram
 Hmm, you have a point.  If 100 backends simultaneously write to 100
 different pages, and all of those pages are all-visible, then it's
 possible that they could end up fighting over the buffer content lock
 on the visibility map page.  But why would you expect that to matter?
 In a heavily updated table, the proportion of visibility map bits that
 are set figures to be quite low, since they're only set during VACUUM.
  To have 100 backends simultaneously pick different pages to write
 each of which is all-visible seems really unlucky.   Even if it does
 happen from time to time, I suspect the effects would be largely
 masked by WALInsertLock contention.  The visibility map content lock
 is only taken very briefly, whereas the operations protected by
 WALInsertLock are much more complex.


by your argument, if WALInserLock is held for 't' seconds, you should
definitely be holding visibility map lock for more than time frame 't' . So
the index scans have to wait for acquiring the lock on visibility map to
check visibility. What we are trading off here is Synchronization Vs I/O.
Should we lose the scalability for performance??

Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-20 Thread Gokulakannan Somasundaram
On Sat, Aug 20, 2011 at 4:48 PM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:


 The above could already happen in 8.4, where the visibility map was
 introduced. The contention on the VM buffer would be just as bad whether you
 hold the heap page lock at the same time or not. I have not heard any
 complaints of contention on VM buffers.

 --
  Heikki Linnakangas


 a) First of all, it(Visibility Map) should have definitely affected the
 scalability of postgres in scenarios where in updates occur during a time
 batch window. May be the increase in speed of vacuums negate that effect.
 b) Second, currently the index scans  don't touch the visibility map and in
 future they are going to acquire share lock on that. This should increase
 the contention.
 c) Your statement : The contention on the VM buffer would be just as bad
 whether you hold the heap page lock at the same time or not.
 I am talking about the contention time frame of the heap page. It will be
 increased Consider my statement in conjunction with the scenario 2.
 d) In addition, currently there is no WAL Logging, while the bit is
 cleared, which would not be the case in future and hence the exclusive lock
 held on the visibility map is going to be held for a longer time.

 You might definitely see some performance improvement, if you are testing
 this in anything less than 4 cores. Bump up the core count and processor
 count and check whether this affects the load-throughput curve.


By your argument, we can say that no-one will create a index with a function
like random(), time(), date(), broken operators etc. Its hard to imagine a
index in which a a user will only want to insert and never select on it.
Even C++ provides std::map infrastructure for objects, where the user owns
the responsibility of writing the operator correctly. Other databases do
the same. Even going one step ahead, we are already going back to such
indexes, if there is unique constraint/ referential integrity constraints.
But with all such caveats, it was decided we should not allow covering
indexes, as they require going back to the index for updates/deletes.


Re: [HACKERS] the big picture for index-only scans

2011-08-20 Thread Gokulakannan Somasundaram
   I think that you have switched gears here and are in this paragraph

 talking about the 8.4-era visibility map changes rather than my recent
 work.  There is zero evidence that those changes caused any
 performance problem.  I've spent a large chunk of the last four months
 looking for scalability problems and I haven't come across any
 evidence that this is an issue.  If you think it is, let's have a test
 case.


Consider the TPC-C benchmark. Currently on a four core box, Oracle does
29 transactions per minute. Say we are doing something around half of
this. So a page should get quickly filled. If a vacuum runs and the
transactions are not touched by the MakePayment transaction, then it will be
marked as AllVisible. When the MakePayment transaction updates, it should
clear that bit. If we test it out, with too little warehouses, we may not
see the effect. Of Course the PGPROC infrastructure for generating
transaction ids might be the biggest culprit, but if you ignore that this
should be visible.

Maybe.  Of course, we're only talking about cases where the index-only
 scan optimization is in use (which is not all of them).

But are you saying that, the optimization of looking at visibility map will
happen only for Index-only scans and will not use the visibility map
infrastructure for the normal index scans? That's definitely a good idea and
improvement.

 d) In addition, currently there is no WAL Logging, while the bit is
cleared,
 which would not be the case in future and hence the exclusive lock held
on
 the visibility map is going to be held for a longer time.

 This is false and has been false since the visibility map was first
implemented.

I can't understand this. If you are not doing this, then it would cause
consistency issues. Are you saying, we have a crash safe visibility map, but
you don't follow log the change before changing the contents/ WAL
principle. If so, please explain in detail. If you are doing it in the
normal way, then you should be logging the changes before making the changes
to the buffer and during that timeframe, you should be holding the lock on
the buffer. Heikki specifically pointed out, that you have brought in the
WAL Logging of visibility map, within the critical section.

Heikki's comments, i am quoting:
 I believe Robert's changes to make the visibility map crash-safe covers
that. Clearing the bit in the visibility map now happens within the same
critical section as clearing the flag on the heap page and writing the WAL
record.

I would be able to respond to your other statements, once we are clear here.

Thanks,
Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-20 Thread Gokulakannan Somasundaram
On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram

 gokul...@gmail.com wrote:
  by your argument, if WALInserLock is held for 't' seconds, you should
  definitely be holding visibility map lock for more than time frame 't'.

 Nope, that's not how it works.  Perhaps you should read the code.
 See, e.g., heap_update().

 --

OK. I took a look at the patch you have supplied in
http://postgresql.1045698.n5.nabble.com/crash-safe-visibility-map-take-five-td4377235.html
There is a code like this.

 {
 all_visible_cleared = true;
 PageClearAllVisible(BufferGetPage(buffer));
+visibilitymap_clear(relation,
+ItemPointerGetBlockNumber((heaptup-t_self)),
+vmbuffer);
 }

Here, you are not making an entry into the WAL. then there is an assumption
that the two bits will be in sync without any WAL entry. There is a chance
that the visibility map might be affected by partial page writes, where
clearing of a particular bit might not have been changed. And i am thinking
a lot of such issues. Can you just explain the background logic behind
ignoring the principle of WAL logging? What are the implemented principles,
that protect the Visibility map pages??

Thanks,
Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-20 Thread Gokulakannan Somasundaram
  By your argument, we can say that no-one will create a index with a
 function
  like random(), time(), date(), broken operators etc. Its hard to imagine
 a
  index in which a a user will only want to insert and never select on it.

 The whole point of this optimization is to make index access cheaper,
 not more expensive.  You seem convinced it's done the opposite, but
 you haven't offered much evidence (or any test results) to support
 that proposition.


I hope you are referring to thick indexes/covering indexes/indexes with
snapshot. Why do you think its done the opposite? In fact all the other
databases like Oracle, SQL-Server, Sybase implement the indexes with
snapshot (that's the only one they support). It makes the index tuple larger
by 8 bytes, but avoids the heap-fetch. I think, i ran a couple of
benchmarks, when i submitted the patch and showed the improvement. The
trade-off in that case was simple. Size of the index Vs avoiding a disk I/O.
User still has the choice of creating indexes without snapshot( it was
provided as an optional index).



 What we decided NOT to do is put xmin/xmax/cid into the index tuple,
 for precisely the reasons you mention.  That would be catastrophic
 both for write operations to the table, and for the size of the index.


Why it would be catastrophic for write operations on table?? -please explain
me.
The trade-off in that case was simple. Size of the index Vs avoiding a disk
I/O. There was no catastrophic damage on the size of the index, as far as i
can see.

I made this point, because Heikki pointed out that since no-one is
complaining about some performance problem, and so we can assume that it
doesn't exist. But the thick index proposal was shot down on the context,
some one might create a index on a function like random(), time(). date() or
with broken operators, effectively meaning that you can insert into the
index and cannot select back. We are already doing unique checks and
referential integrity checks on that kind of indexes(which would all be
wrong), but still we should not be working in that area, to help user not
make that mistake of creating such indexes. So we should apply the same
principle for decision making here also.

Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-19 Thread Gokulakannan Somasundaram


 Note that we already have the visibility map, and the accesses needed to
 update it are already there. Granted, we'll have to change the logic
 slightly to make it crash safe, but I don't expect that to add any
 meaningful overhead - the changes are going to be where the bits are set,
 ie. vacuum, not when the bits are cleared. Granted, we might also want to
 set the bits more aggressively once they're used by index-only-scans. *But
 done correctly, just taking advantage of the VM that's already there
 shouldn't add overhead to other operations.*

 I agree that we need to do tests to demonstrate that there's a gain from
 the patch, once we have a patch to test. I would be very surprised if there
 isn't, but that just means the testing is going to be easy.

 --
  Heikki Linnakangas

  EnterpriseDB   http://www.enterprisedb.com


  I could see some arguments supporting this feature, citing covering
indexes as example. But i just want to highlight they are not same.
Visibility map approach is totally not related to the covering indexes
approach, except the intention of avoiding the heap scan. Because of the
small size, we will be having more contentions(People who have worked with
Oracle can take the example of a bitmap index on a OLTP database). I was
making the suggestion previously to make these crash safe visibility maps
optional for a table, so that the overhead, which comes with it, can be
avoided for those tables, which have queries that don't support index only
scans. The fact that the proposal is for crash safe visibility map, to
become a default package of any Postgresql table will definitely have wide
ranging implications on OLTP performance.

Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-19 Thread Gokulakannan Somasundaram


 Well, that would certainly be alarming if true, but I don't think it
 is.  As far as I can see, the overhead of making the visibility map
 crash-safe is just (1) a very small percentage increase in the work
 being done by VACUUM and (2) a slight possibility of extra work done
 by a foreground process if the visibility map bit changes at almost
 exactly the same time the process was about to insert, update, or
 delete a tuple.

 Let's forget the overhead posed by vacuum. Can you please point me to the
design which talks in detail of the second overhead?

Thanks.


Re: [HACKERS] the big picture for index-only scans

2011-08-19 Thread Gokulakannan Somasundaram

 Well, that would certainly be alarming if true, but I don't think it
 is.  As far as I can see, the overhead of making the visibility map
 crash-safe is just (1) a very small percentage increase in the work
 being done by VACUUM and (2) a slight possibility of extra work done
 by a foreground process if the visibility map bit changes at almost
 exactly the same time the process was about to insert, update, or
 delete a tuple.

 Let's forget the overhead posed by vacuum. Can you please point me to the
 design which talks in detail of the second overhead?

 Thanks.


If you are following the same design that Heikki put forward, then there is
a problem with it in maintaining the bits in page and the bits in visibility
map in sync, which we have already discussed.


Re: [HACKERS] the big picture for index-only scans

2011-08-19 Thread Gokulakannan Somasundaram
On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:

 If you are following the same design that Heikki put forward, then there
 is
 a problem with it in maintaining the bits in page and the bits in
 visibility
 map in sync, which we have already discussed.


 Are you referring to this: http://archives.postgresql.**
 org/pgsql-hackers/2010-02/**msg02097.phphttp://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php?
  I believe Robert's changes to make the visibility map crash-safe covers
 that. Clearing the bit in the visibility map now happens within the same
 critical section as clearing the flag on the heap page and writing th WAL
 record.

 In that case, say a 100 sessions are trying to update records which fall
under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 *
4 exact) covered by one page of visibility map, won't it make the 99
sessions wait for that visibility map while holding the exclusive lock on
the 99 heap pages?
Are we going to say, that these kind of situations occur very rarely? Or
that the loss of scalability in these situations, is worth the performance
during the read-heavy workloads?

In any case, making a database going through all these extra overheads,
while they don't even have any index-only scans!!!  That definitely should
be given a thought.

Gokul.


Re: [HACKERS] the big picture for index-only scans

2011-08-19 Thread Gokulakannan Somasundaram
On Sat, Aug 20, 2011 at 2:51 AM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:

 On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas 
 heikki.linnakan...@enterprisedb.com wrote:

 On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:

 If you are following the same design that Heikki put forward, then there
 is
 a problem with it in maintaining the bits in page and the bits in
 visibility
 map in sync, which we have already discussed.


 Are you referring to this: http://archives.postgresql.**
 org/pgsql-hackers/2010-02/**msg02097.phphttp://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php?
  I believe Robert's changes to make the visibility map crash-safe covers
 that. Clearing the bit in the visibility map now happens within the same
 critical section as clearing the flag on the heap page and writing th WAL
 record.

 In that case, say a 100 sessions are trying to update records which fall
 under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 *
 4 exact) covered by one page of visibility map, won't it make the 99
 sessions wait for that visibility map while holding the exclusive lock on
 the 99 heap pages?
 Are we going to say, that these kind of situations occur very rarely? Or
 that the loss of scalability in these situations, is worth the performance
 during the read-heavy workloads?

 In any case, making a database going through all these extra overheads,
 while they don't even have any index-only scans!!!  That definitely should
 be given a thought.

 Gokul.


Please consider the update, i mentioned above  occurs and changes the
visibility bit of the page.


Re: [HACKERS] Reduce WAL logging of INSERT SELECT

2011-08-05 Thread Gokulakannan Somasundaram

 However, for small operations it's a net loss - you avoid writing a WAL
 record, but you have to fsync() the heap instead. If you only modify a few
 rows, the extra fsync (or fsyncs if there are indexes too) is more expensive
 than writing the WAL.

 We'd need a heuristic to decide whether to write WAL or fsync at the end.
 For regular INSERTs, UPDATEs and DELETEs, you have the planner's estimate of
 number of rows affected. Another thing we should do is move the fsync call
 from the end of COPY (and other such operations) to the end of transaction.
 That way if you do e.g one COPY followed by a bunch of smaller INSERTs or
 UPDATEs, you only need to fsync the files once.


Have you thought about recovery, especially when the page size is greater
than the disk block size(  4K ). With WAL, there is a way to restore the
pages to the original state, during recovery, if there are partial page
writes. Is it possible to do the same with direct fsync without WAL?

Gokul.


Re: [HACKERS] crash-safe visibility map, take four

2011-03-23 Thread Gokulakannan Somasundaram


 All operations that clear the bit area are already WAL-logged.

 Is it the case with visibility map also?

Thanks.


Re: [HACKERS] crash-safe visibility map, take four

2011-03-23 Thread Gokulakannan Somasundaram
Yeah. i looked at it. I don't think it addresses the problem raised here.

http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php

http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.phpOr may be
i am missing something.

Thanks.

On Wed, Mar 23, 2011 at 7:54 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Mar 23, 2011 at 2:29 AM, Gokulakannan Somasundaram
 gokul...@gmail.com wrote:
  All operations that clear the bit area are already WAL-logged.
 
  Is it the case with visibility map also?
  Thanks.

 Yes.  Look at the comment that the patch removes.  That describes the
 problem being fixed.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] crash-safe visibility map, take four

2011-03-22 Thread Gokulakannan Somasundaram

 I took a crack at implementing the first approach described above,
 which seems to be by far the simplest idea we've come up with to date.
  Patch attached.  It doesn't seem to be that complicated, which could
 mean either that it's not that complicated or that I'm missing
 something.  Feel free to point and snicker in the latter case.

 Hi,
I suppose the problem is not with setting the bit, but resetting the
bit. Has that been completed already?

Thanks.


Re: [HACKERS] GSoC Query

2010-03-29 Thread Gokulakannan Somasundaram



  Similarly using the no. of select hits on a table we can check that if
 maximum no. of times it is on a non-index field we can index on that field
 to make select faster.


 It's impractical to figure out where indexes should go at without
 simulating what the optimizer would then do with them against a sample set
 of queries.  You can't do anything useful just with basic statistics about
 the tables.

 I would recommend
 http://msdn.microsoft.com/en-us/library/aa226167(SQL.70).aspxhttp://msdn.microsoft.com/en-us/library/aa226167%28SQL.70%29.aspxas
  a good, practical introduction to the topic of what it takes to figure
 out where indexes go at, from someone who came up with a reasonable solution
 to that problem.  You can find a list of the underlying research they cite
 (and an idea what has been done since then) at
 http://portal.acm.org/citation.cfm?id=673646


Even if you have devised a way to find the appropriate set of indexes, just
have a index adviser, which would advise a set of indexes for a set of
queries and let the DBA and the application user take the final call, after
looking at them..

Gokul.


[HACKERS] Proposal for Performance improvement for unique checks

2010-03-27 Thread Gokulakannan Somasundaram
I don't think this should involve much code change. But no-one
interested

On Sat, Mar 27, 2010 at 2:23 AM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:

 Hi,
Since we insert a new entry into the index for every update that's being
 made into the table, we inevitably make a unique check against the older
 version of the newly inserted row, even when the values are not updated. Of
 course i am talking about non-HOT updates. (We will not go to the index for
 HOT updates)

 a) The page which contains the index entry is Exclusively locked
 b) We go ahead and visit the heap page for its HeapTupleSatisfiesDirty.

 If we have the information of the old tuple(its tuple-id) after a heap
 update, during the index insert, we can avoid the uniqueness check for this
 tuple,as we know for sure that tuple won't satisfy the visibility criteria.
 If the table has 'n' unique indexes it avoids 'n' heap tuple lookups, also
 increasing the concurrency in the btree, as the write lock duration is
 reduced.

 Any comments?

 Thanks,
 Gokul.



Re: [HACKERS] Performance improvement for unique checks

2010-03-27 Thread Gokulakannan Somasundaram

 Please write it, then test the performance and publish your results,
 with a detailed analysis of whether there is benefit and in which cases
 there is a loss.


 Thanks.  Will do it. Just wanted to know, whether the idea will get
rejected right away / worth trying out.

Thanks,
Gokul.


[HACKERS] A Bug in outDatum ?? (Not Sure )

2010-03-27 Thread Gokulakannan Somasundaram
Hi,
   The function outDatum doesn't take care of varlena structures which are
stored externally. Is it the intention??

Thanks,
Gokul.


[HACKERS] Performance improvement for unique checks

2010-03-26 Thread Gokulakannan Somasundaram
Hi,
   Since we insert a new entry into the index for every update that's being
made into the table, we inevitably make a unique check against the older
version of the newly inserted row, even when the values are not updated. Of
course i am talking about non-HOT updates. (We will not go to the index for
HOT updates)

a) The page which contains the index entry is Exclusively locked
b) We go ahead and visit the heap page for its HeapTupleSatisfiesDirty.

If we have the information of the old tuple(its tuple-id) after a heap
update, during the index insert, we can avoid the uniqueness check for this
tuple,as we know for sure that tuple won't satisfy the visibility criteria.
If the table has 'n' unique indexes it avoids 'n' heap tuple lookups, also
increasing the concurrency in the btree, as the write lock duration is
reduced.

Any comments?

Thanks,
Gokul.


[HACKERS] Performance Improvement for Unique Indexes

2010-03-24 Thread Gokulakannan Somasundaram
Hi,
   While i was studying the unique index checks very closely, i realized
that what we need is to find out whether the tuple is deleted / not. So say
a tuple is deleted by a transaction, but it is not dead( because of some
long running transaction ), still we can mark a hint bit as deleted and it
will help the subsequent transactions doing the unique checks. As a matter
of fact, it will help the deferred_unique cases, since it will anyway check
the tuples twice, if there is a duplicate.
   So we have one bit left in the Index Tuple that can be used as hint bit.
If we are ready to break the disk compatibility, then we can store the size
as a multiple of 8, and we will get three bits free. Any comments?

Thanks,
Gokul.


Re: [HACKERS] Performance Improvement for Unique Indexes

2010-03-24 Thread Gokulakannan Somasundaram


 How are you going to unmark the hint bit in case of a rollback?


Only after you find that the transaction is committed, this hint bit has to
be set. It is equivalent to any other hint bit.

Gokul.


Re: [HACKERS] Performance Improvement for Unique Indexes

2010-03-24 Thread Gokulakannan Somasundaram
 it seems fairly unlikely to me that this would be useful enough to
 justify using up a precious hint bit.  The applicability of the hint
 is very short-term --- as soon as the tuple is dead to all transactions,
 it can be marked with the existing LP_DEAD hint bit.  And if it's only
 useful for uniqueness checks, as seems to be the case, that's another
 big restriction on the value.

 Right. It is of little value.

Gokul.


[HACKERS] Deadlock possibility in _bt_check_unique?

2010-03-23 Thread Gokulakannan Somasundaram
Hi,
   With the implementation of deferred unique constraints, we need to go
back to the index second time to check whether the unique check is valid.
Say a situation occurs like this
a) the first session doing the unique check finds out that there is a unique
check required second time and just makes its entry and comes back
b) the second session doing the unique check finds out that there is a
unique check required second time and just makes its entry and comes back

While they do the second check, first session will wait for the session to
complete and vice versa. Won't this result in a deadlock? Isn't this a
realistic scenario?

Thanks,
Gokul.


Re: [HACKERS] Deadlock possibility in _bt_check_unique?

2010-03-23 Thread Gokulakannan Somasundaram
Can you also explain how are we avoiding duplicates in this scenario?

a) Say there are three pages(P,Q, R) full of duplicate tuples, that are
deleted but not dead of id x(due to some long running transaction).
b) Now Session A gets in and checks the duplicate tuples for their
liveliness with the HeapTuple for id x with shared lock on all the pages P,
Q and R. Since all are deleted, it will get the message, that it need not
come back to check again for uniqueness Finally it again starts from P to
check for freespace to insert its tuple. Say it inserts the tuple at page Q.
c) Now Session B(with same id x) starts after Session A, but it passes Q
before the insertion of the tuple by Session A. It will also get the
response from _bt_check_unique, that it need not comeback for second time
unique check. Now it checks for freespace from P and it finds freespace at
P. Then it will insert the new record at P itself.

So we have two duplicate records, eventhough there is a unique constraint.
Is this a possible scenario?

Thanks,
Gokul.


Re: [HACKERS] Deadlock possibility in _bt_check_unique?

2010-03-23 Thread Gokulakannan Somasundaram

 Are you talking about exclusion constraints or btree uniqueness
 constraints?  This doesn't seem to be a particularly accurate
 description of the implementation of either one.  The way btree
 deals with this is explained in _bt_doinsert:


Unique constraints



 * NOTE: obviously, _bt_check_unique can only detect keys that are
 already
 * in the index; so it cannot defend against concurrent insertions of
 the
 * same key.  We protect against that by means of holding a write lock
 on
 * the target page.  Any other would-be inserter of the same key must
 * acquire a write lock on the same target page, so only one would-be
 * inserter can be making the check at one time.  Furthermore, once we
 are
 * past the check we hold write locks continuously until we have
 performed
 * our insertion, so no later inserter can fail to see our insertion.
 * (This requires some care in _bt_insertonpg.)



This is fine, if the second session has to pass through the page, where the
first session inserted the record. But as i said if the second session finds
a free slot before hitting on the page where the first session inserted,
then it will never hit the page with write lock. The comment says that,

Furthermore, once we are
* past the check we hold write locks continuously until we have
performed
* our insertion, so no later inserter can fail to see our insertion.
* (This requires some care in _bt_insertonpg.) 

But in the case, i suggested(page1, page2, page3 with non-dead duplicate
tuples, which are deleted), the first session checks page1, finds no
freespace, moves to page2, finds freespace and inserts. But when the second
session checks page1, say some of the tuples become dead. Then it gets
freespace there and inserts, but never sees the insertion of the first
session. Maybe, i am missing something. Just thought of raising the flag..

Gokul.


Re: [HACKERS] Deadlock possibility in _bt_check_unique?

2010-03-23 Thread Gokulakannan Somasundaram
 No, you don't understand how it works.  All would-be inserters will hit
 the same target page to begin with, ie, the first one that the new key
 could legally be inserted on.  The lock that protects against this
 problem is the lock on that page, regardless of which page the key
 actually ends up on.


Consider Time instances T1, T2, T3, T4

T1 : session 1 holds the write lock on page p1 and completes the unique
check on p1, p2 and p3.

T2 : session 1 releases the lock on p1 (its waiting to acquire a ex lock on
p2)

T3 : session 2 acquires write lock on p1 and completes the unique check on
p1, p2 and p3. Here, i believe the Session 2
has a chance of getting the read lock before session 1 gets the write lock.
Is it not possible?

T4 : session 1 gets the write lock on p2 and inserts and session 2 gets the
write lock on p1 and inserts.

OK. I have stated my assumptions. Is my assumption at time instant T3 wrong/
never happen?

Thanks,
Gokul.


Re: [HACKERS] Deadlock possibility in _bt_check_unique?

2010-03-23 Thread Gokulakannan Somasundaram
  T2 : session 1 releases the lock on p1 (its waiting to acquire a ex lock
 on
  p2)

 That's not what we do.  See _bt_findinsertloc.

regards, tom lane


  I am really confused. Please keep the cool and explain me, if i am
wrong. I could see this code in _bt_findinsertloc. There is a
_bt_relandgetbuf, which releases lock on p1 and tries to acquire a lock on
p2. I wrote ex lock in the place of BT_WRITE.
*
*

*00589 for (;;)
00590 {
00591 BlockNumber
http://doxygen.postgresql.org/block_8h.html#0be1c1ab88d7f8120e2cd2e8ac2697a1
rblkno = lpageop-btpo_next
http://doxygen.postgresql.org/structBTPageOpaqueData.html#0e96302f6e2aa4203cef0e243362b692;
00592
00593 rbuf = _bt_relandgetbuf
http://doxygen.postgresql.org/nbtpage_8c.html#023261cd645fc5e8b4e8517fe9027bd6(rel,
rbuf, rblkno, BT_WRITE
http://doxygen.postgresql.org/nbtree_8h.html#e494b1ec6ecbe7251dfc412a1ec53c1b);
00594 page = BufferGetPage
http://doxygen.postgresql.org/bufmgr_8h.html#fb570c83a17839dabeb75dba7ea8e1a5(rbuf);
00595 lpageop = (BTPageOpaque
http://doxygen.postgresql.org/structBTPageOpaqueData.html)
PageGetSpecialPointer
http://doxygen.postgresql.org/bufpage_8h.html#3be45495654ca1ff61f6ae45805e25f6(page);
00596 if (!P_IGNORE
http://doxygen.postgresql.org/nbtree_8h.html#a8df35238449d00d7e14c3f1ccd3121c(lpageop))
00597 break;
00598 if (P_RIGHTMOST
http://doxygen.postgresql.org/nbtree_8h.html#8b5e4857926b514c0f97592bf3344293(lpageop))
00599 elog
http://doxygen.postgresql.org/elog_8h.html#850290250a1449fc513da20ae2ce5ec5(ERROR
http://doxygen.postgresql.org/elog_8h.html#8fe83ac76edc595f6b98cd4a4127aed5,
fell off the end of index \%s\,
00600  RelationGetRelationName
http://doxygen.postgresql.org/rel_8h.html#5e9d450c92f70171110e22ffc678ed04(rel));
00601 }*


What is that, i am missing here?

Gokul.


Re: [HACKERS] Deadlock possibility in _bt_check_unique?

2010-03-23 Thread Gokulakannan Somasundaram
Oh! yeah, i got it.  Thanks

On Wed, Mar 24, 2010 at 12:26 AM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:


  T2 : session 1 releases the lock on p1 (its waiting to acquire a ex lock
 on
  p2)

 That's not what we do.  See _bt_findinsertloc.

regards, tom lane


   I am really confused. Please keep the cool and explain me, if i am
 wrong. I could see this code in _bt_findinsertloc. There is a
 _bt_relandgetbuf, which releases lock on p1 and tries to acquire a lock on
 p2. I wrote ex lock in the place of BT_WRITE.
 *
 *

 *00589 for (;;)
 00590 {
 00591 BlockNumber 
 http://doxygen.postgresql.org/block_8h.html#0be1c1ab88d7f8120e2cd2e8ac2697a1
  rblkno = lpageop-btpo_next 
 http://doxygen.postgresql.org/structBTPageOpaqueData.html#0e96302f6e2aa4203cef0e243362b692;
 00592
 00593 rbuf = _bt_relandgetbuf 
 http://doxygen.postgresql.org/nbtpage_8c.html#023261cd645fc5e8b4e8517fe9027bd6(rel,
  rbuf, rblkno, BT_WRITE 
 http://doxygen.postgresql.org/nbtree_8h.html#e494b1ec6ecbe7251dfc412a1ec53c1b);
 00594 page = BufferGetPage 
 http://doxygen.postgresql.org/bufmgr_8h.html#fb570c83a17839dabeb75dba7ea8e1a5(rbuf);
 00595 lpageop = (BTPageOpaque 
 http://doxygen.postgresql.org/structBTPageOpaqueData.html) 
 PageGetSpecialPointer 
 http://doxygen.postgresql.org/bufpage_8h.html#3be45495654ca1ff61f6ae45805e25f6(page);
 00596 if (!P_IGNORE 
 http://doxygen.postgresql.org/nbtree_8h.html#a8df35238449d00d7e14c3f1ccd3121c(lpageop))
 00597 break;
 00598 if (P_RIGHTMOST 
 http://doxygen.postgresql.org/nbtree_8h.html#8b5e4857926b514c0f97592bf3344293(lpageop))
 00599 elog 
 http://doxygen.postgresql.org/elog_8h.html#850290250a1449fc513da20ae2ce5ec5(ERROR
  
 http://doxygen.postgresql.org/elog_8h.html#8fe83ac76edc595f6b98cd4a4127aed5,
  fell off the end of index \%s\,
 00600  RelationGetRelationName 
 http://doxygen.postgresql.org/rel_8h.html#5e9d450c92f70171110e22ffc678ed04(rel));
 00601 }*


 What is that, i am missing here?

 Gokul.



[HACKERS] Proposal for Byte savings in VarBit structure

2010-03-21 Thread Gokulakannan Somasundaram
Hi,
   I was looking at the VarBit data structure and found out that instead of
storing the number of bits in four bytes, we can save the number of bits
that are valid in the last byte. Since we already store the number of bytes
in Varlena Header, we can calculate the number of valid bits by doing the
small math. ( (No. of bytes-1) * 8  + valid bits in the last byte).
  This would save atleast 8 bytes for someone, who is using the varbit data
type using less than 24 bits.

  Waiting for comments.

Thanks,
Gokul.


Re: [HACKERS] Proposal for Byte savings in VarBit structure

2010-03-21 Thread Gokulakannan Somasundaram
 This might be worth considering in a release cycle where we were going
 to break on-disk data compatibility for some other reason.  But I can
 hardly imagine wanting to do it by itself.  Marginal space savings for
 the bit types just isn't that exciting.

 Maybe we should start a special section of TODO for things we might do
 next time we break data compatibility.


Thanks Tom. So when we say breaking data compatibility, we mean the next
release where we will recommend pg_dump/restore right?

Thanks,
Gokul.


Re: [HACKERS] An idle thought

2010-03-19 Thread Gokulakannan Somasundaram
 Surely the VM is already update-friendly. If you update a tuple in a
 page with the visibility bit set, the bit must be unset or you will get
 wrong results.



I was referring in the context of index only scans to skip visibility
checks. I doubt, whether the visibility map feature to skip visibility
checks at the heap can be created without any extra cost to updates/inserts.
When a data is compressed then there is more contention for the same block
and hence would likely affect DMLs. I hope that's what Tom was also
referring to, but not in the visibility map context.

Gokul.


Re: [HACKERS] An idle thought

2010-03-18 Thread Gokulakannan Somasundaram
On Thu, Mar 18, 2010 at 2:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Jeff Davis pg...@j-davis.com writes:
  There are all kinds of challenges there, but it might be worth thinking
  about. Visibility information is highly compressible, and requires
  constant maintenance (updates, deletes, freezing, etc.). It also might
  make it possible to move to 64-bit xids, if we wanted to.

 If you want it to be cheaply updatable (or even cheaply readable),
 compression is not what you're going to do.

regards, tom lane

 +1..


Re: [HACKERS] An idle thought

2010-03-18 Thread Gokulakannan Somasundaram

  Secondly there's the whole retail vacuum problem -- any
 index entries referring to this page would be left dangling unless
 there's some kind of retail vacuum or perhaps a page version number.


The issue, we can divide into two
a)volatile functions
b)broken datatypes

For a) I think volatile function issue can be resolved by using hidden
columns in the heap itself. This will create a duplication of data, but
since the index will get created on it, it will always point to the right
heap tuple

For b) We are already suffering with this issue in any index lookups, index
based updates/deletes, unique constraints, referential integrity maintenance
etc. So hopefully one day we can extend this list to include more :))

Gokul.


Re: [HACKERS] An idle thought

2010-03-18 Thread Gokulakannan Somasundaram


 I didn't mean that we'd want to compress it to the absolute minimum
 size. I had envisioned that it would be a simple scheme designed only to
 eliminate long runs of identical visibility information (perhaps only
 the frozen and always visible regions would be compressed).

 The extra level of indirection would be slower, but if we freeze tuples
 more aggressively (which would be much cheaper if we didn't have to go
 to the heap), there might be a small number of tuples with interesting
 visibility information at any particular time.


 This should be achievable with current proposal of Heikki, but I think it
is useful only for tables which won't have more concurrent operations and on
databases without any long running queries. So if we have an option to
create a visibiliy map ( on the lines of materialized view ), whenever we
want for a table, it would help a good number of use cases, i suppose.

Thanks,
Gokul.


Re: [HACKERS] An idle thought

2010-03-18 Thread Gokulakannan Somasundaram
 The visibility map itself is already an example of compression. If
 visibility information were randomly distributed among tuples, the
 visibility map would be nearly useless.


 I believe it is very difficult to make visibility map update friendly
without compromising durability.  But such a functionality is very much
wanted in PG still.


Re: [HACKERS] Bug in 9.0Alpha4

2010-03-17 Thread Gokulakannan Somasundaram


 When we were doing the ordered-aggregates patch, I considered passing
 all those values as explicit parameters to transformAggregateCall,
 and having it build the Aggref node from scratch and return it.
 However having seven or eight parameters to transformAggregateCall
 (and more in future if we ever add more features here) didn't really
 seem to be better style than abusing Aggref a bit.  But maybe it is
 the best way after all.  Thoughts?


I feel it would be good, if we send the parameters explicitly and if that
increases, put it inside another structure(data carriage structure) and send
it.. But please take my suggestion as a novice one. :))

Thanks,
Gokul.


[HACKERS] Bug in 9.0Alpha4

2010-03-16 Thread Gokulakannan Somasundaram
Hi,
   I noticed a problem with the source code of 9.0Alpha 4. In parse_agg.c,
there is a call made to transformSortClause.

00098 torder = transformSortClause
http://doxygen.postgresql.org/parse__clause_8c.html#53199c36a198b5acf15a26fbd7311f79(pstate,
00099  agg-aggorder
http://doxygen.postgresql.org/structAggref.html#f477b6dc44bd60585cabf8608dcf2047,
00100  tlist,
00101  true /* fix unknowns */ ,
00102  true /* force SQL99 rules */ );
00103


   Here agg-aggorder should be a List of SortGroupClause pointers, whereas
transformSortClause expects the second argument as a list of SortBy
pointers. I verified the doxygen code by downloading the 9.0alpha4 version.
I am trying to understand this piece of code, while i thought i should
report this bug.

Thanks,
Gokul.


Re: [HACKERS] Bug in 9.0Alpha4

2010-03-16 Thread Gokulakannan Somasundaram
Hi,
I think, this should be the probable fix.

There is agg_order in ParseFuncOrColumn, which should get passed on to
transformAggregateCall and that should be placed in this call, instead of
agg-aggorder.

Thanks,
Gokul.

On Tue, Mar 16, 2010 at 5:19 PM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:

 Hi,
I noticed a problem with the source code of 9.0Alpha 4. In parse_agg.c,
 there is a call made to transformSortClause.

 00098 torder = transformSortClause 
 http://doxygen.postgresql.org/parse__clause_8c.html#53199c36a198b5acf15a26fbd7311f79(pstate,
 00099  agg-aggorder 
 http://doxygen.postgresql.org/structAggref.html#f477b6dc44bd60585cabf8608dcf2047,
 00100  tlist,
 00101  true /* fix unknowns */ ,
 00102  true /* force SQL99 rules */ );
 00103


Here agg-aggorder should be a List of SortGroupClause pointers, whereas
 transformSortClause expects the second argument as a list of SortBy
 pointers. I verified the doxygen code by downloading the 9.0alpha4 version.
 I am trying to understand this piece of code, while i thought i should
 report this bug.

 Thanks,
 Gokul.



Re: [HACKERS] Bug in 9.0Alpha4

2010-03-16 Thread Gokulakannan Somasundaram

 transformSortClause is passed the untransformed aggorder list, which is
 in fact a list of SortBy nodes, and it returns the transformed list
 (SortGroupClause nodes), which is stored back into the aggorder field
 a bit further down.

 There are a number of regression tests that would fail in obvious ways
 if this code didn't work.

 Right Tom.  I got confused, because the comment at Aggref struct definition
told that it is a list of SortGroupClause. May be you can update your
comments there.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-03-02 Thread Gokulakannan Somasundaram


 I think you have to take up a simpler project as a first project. This
 is a major overhaul of transaction information and it depends on
 understanding how a lot of different areas work -- all of which are
 very complex tricky areas to understand.



Greg,
  I just feel the fast full index scan may not be of much value, if
we have to go to the table for visibility information. I think the feature
needs the visibility map to get completed. Please let me know, if you feel
otherwise.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-03-01 Thread Gokulakannan Somasundaram

  a) We are already going from table to index to do unique checks. This is
 the
  same thing, which we will do to go and update the snapshot in the
 indexes.

 No, it is not the same thing.  Updating index snapshots requires being
 able to *re-find* a previously made index entry for the current row.
 And it has to be done 100% reliably.  The worst that happens if an index
 entry is not found when it should be during a uniqueness check is that
 the uniqueness constraint is not enforced properly; which is bad but it
 doesn't lead to internally-inconsistent data structures.


Tom,
We are also going to indexes to maintain the referential integrity
constraints like foreign keys. Say there are constraints like 'On Delete
Cascade' and 'On Delete Restrict', they are maintained through the indexes
and if we say that indexes can return wrong results, then the referential
integrity is lost and we no longer are ACID compliant.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-28 Thread Gokulakannan Somasundaram
 No, it is not the same thing.  Updating index snapshots requires being
 able to *re-find* a previously made index entry for the current row.
 And it has to be done 100% reliably.  The worst that happens if an index
 entry is not found when it should be during a uniqueness check is that
 the uniqueness constraint is not enforced properly; which is bad but it
 doesn't lead to internally-inconsistent data structures.


Hmmm... OK Fine... I am leaving this proposal once and for all.



 Pretending the problem doesn't exist doesn't make it go away ...

 Because this is how it is done in other databases
Ref: .http://www.akadia.com/services/ora_function_based_index_2.html

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-28 Thread Gokulakannan Somasundaram

 The transaction information on tuples take 18 bytes plus several info
 bits. It's possible just storing a subset of that would be useful but
 it's unclear. And I think it would complicate the code if it had to
 sometimes fetch the heap tuple to get the rest and sometimes doesn't.


Visibility map had a similar proposal and it got accepted. Fine... I think,
if you guys are going to stress so hard, then there might be some issues,
which i am not foreseeing right now.



 I think you have to take up a simpler project as a first project. This
 is a major overhaul of transaction information and it depends on
 understanding how a lot of different areas work -- all of which are
 very complex tricky areas to understand.

 Yep.. i would start by just joining in someone's project to help them out.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-27 Thread Gokulakannan Somasundaram
If i have got over excited in the previous update, please ignore that.

a) We are already going from table to index to do unique checks. This is the
same thing, which we will do to go and update the snapshot in the indexes.
b) The way, it should work would be to have a check on whether the operator
is broken / function is volatile and put the onus on the user to make sure
that they are updated correctly.
c) In the ItemId, instead of removing the size field completely, we can
store the size as size/4(since it is MaxAligned). This will save us 2 bits.
In index we only need 13 bits to store the complete size in the tuple, but
we use 15 bits in the iid, so again we can have two more bit savings there.
That's sufficient for us to express the hint fields in a index. I think
Karl's way of expressing it requires only one bit, which looks very
efficient. So we can check the hint bits from the iid itself.

So just with a addition of 8 bytes per tuple, we can have the snapshot
stored with the index. Can someone please comment on this?

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram

 To be a bit more concrete: the typical sort of failure that you could
 get from broken btree operators is failure of transitivity, that is
 the comparators report A  B and B  C for some A, B, C, but do not say
 that A  C when those two values are compared directly.  I don't see any
 convenient way to detect that as a byproduct of normal index operations,
 because you wouldn't typically have a reason to make all three
 comparisons in close proximity.  Indeed, the searching and sorting
 algorithms do their best to avoid making redundant comparisons of that
 kind.


This is interesting Tom, but i am unable to understand, why it won't affect
the current indexes. While insertion it might get inserted in a block and
offset, and while searching it might either return no results / show a wrong
place. Because ordering is required for searching also right? I definitely
feel, i am missing something here.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram

 It definitely affects current indexes. We can't completely avoid bad user
 functions. That is why it is important to put limits on how much damage they
 can do. That's the motivation for the idea I mentioned before, of
 double-checking visibility data in an IndexTuple before letting it survive a
 VACUUM.


No i don't say it would affect Vacuum, but i am suspecting that it would
affect Index based select. Since Vacuum uses a sequential scan of tuples, it
doesn't require the ordering operator, but any index based search would
require a ordering operator for binary search and for comparing with the
right most key.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram


 Much better to take on a simple project like enabling full
 sequential index scans which you claimed you had a solution for and
 which is in any case an important sub-problem for IOT.


 Greg,
  Well i don't think i am ready to take up a project of this size.
But at the same time some important features are lagging in postgres and
someone should start working on them to make the database compete with other
databases effectively. So i would request people like Tom, Heikki, Simon and
you to take up a major project like this and provide the necessary impetus
to the adoption of the database.
  I have never written much code in C, and even if write it, i am
sure i will receive the comment that it is a unmaintainable code.(eg: Thick
index code and trailing nulls code) So its better i start working with one
of you guys to get a hang of developing maintainable code. So i would
request one of you to initiate the development and provide the necessary
directions to me, if possible. That will save my development effort and your
reviewing effort.
At the sametime, features like IOT, index only scans are features
which are very necesary for postgres to atleast make it get inside my
company. So i am putting forward all my arguments as a DB user.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram
Missed the group..

On Sat, Feb 27, 2010 at 12:00 AM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:


 I definitely think thick indexes were too ambitious of a target for a
 first time patch. Sequential index scans is very ambitious itself
 despite being significantly simpler (if you have a solution which
 works -- we haven't had one thus far).


 The point, i am trying to bring out is that i want to work with one of the
 senior persons of the community to do my first few patches.



 Can you point me to the thread on trailing nulls? I think trimming
 off any null columns from the ends of tuples when forming them should
 be a cheap and easy optimization which just nobody's gotten around to
 doing. If that's what you mean then I'm surprised you had any trouble
 getting buy-in for it.

 http://archives.postgresql.org/pgsql-hackers/2008-03/msg00682.php
 I think, the buy-in became difficult because of the code quality.


 Thanks,
 Gokul.




Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram
 IIRC, what was being talked about was shoehorning some hint bits into
 the line pointers by assuming that size and offset are multiples of 4.
 I'm not thrilled with having mutable status bits there for reliability
 reasons, but it could be done without breaking a lot of existing code.
 What I was reacting to above was a suggestion that we could delete the
 itempointer size field altogether, which seems unworkable for the
 reasons I mentioned.


I think then we can pursue on using the IndexTuple structure similar to
HeapTuple(as you have suggested in an earlier update). This would involve(i
believe)
a) Making the current IndexTuple  into IndexTupleHeader
b) Creating a new structure called IndexTuple which will store the size and
the have a pointer to IndexTupleHeader.

But Tom, can you please explain me why that broken ordering example doesn't
affect the current index scans.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram
 It does.  The point is that the system is set up to limit the bad
 consequences.  You might (will) get wrong query answers, but the
 heap data won't get corrupted.


Again Tom, if there is an update based on index scan, then it takes the
tupleid and updates the wrong heap data right?
The only difference between normal index and thick index is to reach back to
the same index tuple to update the snapshot. How will that corrupt the heap
data? Did you intend to say that it corrupts the index data?

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram
 It does.  The point is that the system is set up to limit the bad
 consequences.  You might (will) get wrong query answers, but the
 heap data won't get corrupted.


 Tom,
   if this is our goal - *can return wrong query answers, but
should not corrupt the heap data.* and if we make Thick indexes capable of
that, can i consider that as a thumbs up from your side? As you may already
know, this will only happen when there is a volatile function based index.

Heikki,
  Please let me know, if you feel otherwise.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram
 No, what generally happens is it fails to find a matching index entry at
 all, because the search algorithm concludes there can be no match based
 on the limited set of comparisons it's done.  Transitivity failures lead
 to searching the wrong subset of the index.


Actually Tom, i am not able to understand that completely. But what you are
saying that in the current scenario, when there is a broken data type based
index, then it will return no results, but never will return wrong results.
So never the update will corrupt the heap data. But i take it as you say
(please, correct me, if  i am wrong).
But even returning no results might lead to failures in unqiue checks. While
i inserting, i try to check whether a particular data is already inserted
and if it returns no results, then it will go ahead and insert the data
assuming that the unique check has passed, while in reality it has failed.

Wait a minute.  Bingo  So for unique checks we are already going to
index from Heap. So it is the same thing i am doing with Thick index. So if
we can trust our current unique checks, then we should trust the Thick
index.

Thanks Tom!!! for having this good conversation

I think this broken data type problem / volatile function issue has to be
resolved for the current index, if we advocate to stop the thick index.
WOW!!!


Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram
 Wait a minute.  Bingo  So for unique checks we are already going to
 index from Heap. So it is the same thing i am doing with Thick index. So if
 we can trust our current unique checks, then we should trust the Thick
 index.

 Thanks Tom!!! for having this good conversation

 I think this broken data type problem / volatile function issue has to be
 resolved for the current index, if we advocate to stop the thick index.
 WOW!!!


I think, this opens up lot of opportunities for improvement in Postgres.
a) HOT can now extend its reach beyond page boundaries
b) If a heap has three indexes and the update is going to affect only one
index, then we need not update the other two indexes.

HOT can have more cleaner and fresh approach. If we have both normal index
without snapshot and the thick index, Postgres can boast itself of having a
very rich index family, in which it has some index structures for
update/delete intensive transactions(normal index) and the thick index for
select based transactions.

Marketing folks can easily advertise the product.:

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-26 Thread Gokulakannan Somasundaram


 Actually Tom, i am not able to understand that completely. But what you are
 saying that in the current scenario, when there is a broken data type based
 index, then it will return no results, but never will return wrong results.
 So never the update will corrupt the heap data. But i take it as you say
 (please, correct me, if  i am wrong).
 But even returning no results might lead to failures in unqiue checks.
 While i inserting, i try to check whether a particular data is already
 inserted and if it returns no results, then it will go ahead and insert the
 data assuming that the unique check has passed, while in reality it has
 failed.

 Wait a minute.  Bingo  So for unique checks we are already going to
 index from Heap. So it is the same thing i am doing with Thick index. So if
 we can trust our current unique checks, then we should trust the Thick
 index.

 Thanks Tom!!! for having this good conversation

 I think this broken data type problem / volatile function issue has to be
 resolved for the current index, if we advocate to stop the thick index.
 WOW!!!


 Can i get a feedback from Tom / Heikki regarding my observation?

Regards,
Gokul.


Re: [HACKERS] Lock Wait Statistics (next commitfest)

2010-02-26 Thread Gokulakannan Somasundaram
I am just adding my two cents, please ignore it, if its totally irrelevant.
While we do performance testing/tuning of any applications, the important
things, a standard monitoring requirement from a database are
a) Different type of wait events and the time spent in each of them
b) Top ten Queries by Total Logical reads  Average Logical Reads
c) Top ten Queries by Total CPU Time  Average CPU Time

The monitoring methodology should not put too much overhead during the test
to invalidate the application response times captured during the performance
test (Let's not worry about Heisenberg uncertainty for now :)) )

Of all the databases i worked with, Oracle provides the best monitoring
product in the form of Statspack.

Statspack works by the following way -a) it takes a copy of important
catalog tables(pg_ tables) which store the information like wait statistics
against wait events, i/o statistics cumulative against each SQL_Hash( and
SQL_Text), whether a particular plan went for hard parse/ soft parse(because
of plan caching) and the status of different in-memory data structures etc.

So we take a snapshot like this before and after the test and generate
statspack report out of it, which contains all the necessary information for
database level tuning. So we are never left in the dark from database tuning
perspective.

Recently i wrote a set of SQL Statements, which will do the same for SQL
Server from their sys tables like wait_io_events, query_io_stats etc and
finally will retrieve the information in the same format as Statspack.

But i think we lack some functionality like that in Postgres. I think things
like DTrace are more for developers than for users and as already pointed
out, will work only in Solaris. While we can expect that for Linux shortly,
people in windows do not have much options. (While i am maintaining that
DTrace is a absolutely wonderful hooking mechanism). So we should aim to
develop a monitoring mechanism like statspack for postgres.

Hope i have delievered my concern.

Thanks,
Gokul.




On Sat, Feb 27, 2010 at 10:40 AM, Greg Smith g...@2ndquadrant.com wrote:

 Bruce Momjian wrote:

 What happened to this patch?



 Returned with feedback in October after receiving a lot of review, no
 updated version submitted since then:

 https://commitfest.postgresql.org/action/patch_view?id=98

 --
 Greg Smith  2ndQuadrant US  Baltimore, MD
 PostgreSQL Training, Services and Support
 g...@2ndquadrant.com   www.2ndQuadrant.us



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Testing of parallel restore with current snapshot

2010-02-26 Thread Gokulakannan Somasundaram
Tom,
I just took the patch, but it seems to be in binary format. Can you send
me the patch to me?

Thanks,
Gokul.

On Sat, May 30, 2009 at 3:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Josh Berkus j...@agliodbs.com writes:
  Tom,
  Is anyone interested enough to try it if I code it?

  If you're patient for results, sure.  I seem to be doing a customer
  migration or upgrade every week now, so it wouldn't take me long to have
  a test subject with a fairly complex database.

 Here's a draft patch that does ordering using two lists, as I proposed.
 Please test to see if it's any faster or slower than the original logic.

 Note: since this changes struct TocEntry, be sure to recompile all files
 in src/bin/pg_dump/ after patching.

regards, tom lane



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
 Yes. When a bit is cleared, that's OK, because a cleared bit just means
 you need to check visibility in the heap tuple. When a bit is set,
 however, it's important that it doesn't hit the disk before the
 corresponding heap page update. That's why visibilitymap_set() sets the
 LSN on the page.

 OK. Say a session doing the update, which is the fist update on the page,
resets the PD_ALL_VISIBLE and just before updating the visibility map
crashes. The subsequent inserts/updates/deletes, will see the PD_ALL_VISIBLE
flag cleared and never care to update the visibility map, but actually it
would have created tuples in index and table. So won't this return wrong
results?

Again it is not clear from your documentation, how you have handled this
situation?

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
 The replay of the heap insert/update/delete record updates the
 visibility map.

 So are you planning to make that section, which writes the xlog and updates
the visibility map inside a PANIC section right?


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
 The replay of the heap insert/update/delete record updates the
 visibility map.


Say a checkpoint has occured in between and flushed the dirty pages into
disk, while the updater waits to update the visibility map. Now there will
be no replay for the insert/update/delete right?


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
 1) transaction information in index

This seems like a lot of bloat in indexes. It also means breaking
 a lot of other optimizations such as being able to read the tuples
 directly from the heap page without locking. I'm not sure how much
 those are worth though. But adding 24 bytes to every index entry seems
 pretty unlikely to be a win anyways.


Greg,
  I think, somewhere things have been misunderstood. we only need 8
bytes more per index entry. I thought Postgres has a 8 byte transaction id,
but it is only 4 bytes, so we only need to save the insertion and deletion
xids. So 8 bytes more per tuple.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
 Wait a second, which idea are we currently talking about?  No heap at
 all, or just the ability to check visibility without visiting the heap?


I was talking about the indexes with snapshot



 If it's a genuine IOT (ie no separate heap), then you are not going to
 be able to get away without a full heap tuple header.  We've sweated
 blood to get that struct down to where it is; there's no way to make it
 smaller without giving up some really fundamental things, for example
 the ability to do UPDATE :-(


Of course, as i said, the leaf pages will have HeapTuples in IOT. As a
Postgres user, definitely i am thankful for what has been done.


 If you just want to avoid a heap visit for visibility checks, I think
 you'd only need to add xmin/xmax/cmin plus the hint bits for same.
 This is going to end up costing 16 bytes in practice --- you might
 think you could squeeze into 12 but on 64-bit machines (MAXALIGN 8)
 you'll save nothing.  So that's effectively a doubling of index size
 for common cases such as a single int4 or int8 index column.


Yes but currently we are storing the size of index in IndexTuple, which is
also stored in ItemId. If we can somehow make it use that info, then we have
13 bits of flag for free and we can reduce it to 8 bytes of extra info. But
we need you to sweat some more blood for that :). But again, unless we
resolve the volatile functions issue, there is no use in worrying about
this.


 The other
 problem is the extra write load created by needing to update the index's
 copies of the hint bits; not to mention extra writes to freeze the xids
 when they get old enough.

But Tom, i remember that the vacuum was faster when index had visibility
info, since we need not touch the table. But maybe i am wrong. Atleast i
remember that was the case, when the
relation had only thick indexes.
Oh..Yeah... visibility map might have changed the equation.


Thanks,
Gokul


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram

 I disagree with that, Gokul -- if the ordering operators are volatile or
 just incorrect, during DELETE, you could set xmax in the wrong IndexTuple.
 Then there will be another IndexTuple that says it's visible, but it points
 to a non-visible heap tuple. I think you should follow the pointers to the
 heap before you decide to let an index tuple remain in the index during
 vacuum. This would ensure that all references from an index to a heap tuple
 are removed before vacuuming the heap tuple. I would be worried about what
 might break if this invariant doesn't hold.


Well, Karl, if we have to support function based indexes/IOT, one thing is
for sure. We can't support them for volatile functions / broken data types.
Everyone agrees with that. But the question is how we identify something is
not a volatile function. Only way currently is to let the user make the
decision( Or we should consult some mathematician ). So we need not consult
the heaptuple.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
Tom,

Actually, if you need to squeeze a few more bits into that word, the
 thing to do would be to get rid of storing the tuple length there.
 This would involve adding the same type of indirection header that
 we use for HeapTuples, so that the length would be available at need
 without going back to the item pointer.  I


I feel the other one is easy. To store the hint bits inside the ItemId, in
the place of size. We have 16 bits there.Whenever the size is required, we
need to follow the offset and goto the corresponding tuple and then take the
size from there. The change seems to be minimal, but please bear with me, if
i am very ignorant about something.



   Squeezing cmin in there is just fantasy though.


I think we can get away with this, by making the person, who inserts and
selects in the same transaction to go and find the visibility through heap.
In the Index tuple hint bits, we can note down, if the command is a simple
insert/update/delete. By Simple insert, i mean that it doesn't have a
select. So if that is the case, it can be made visible to statements within
the same transaction. We can even document, that people can just insert a
savepoint between their insert and select. This would increase the xid and
make that tuple visible within the same transaction. All that seems to be
possible.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram


 First of all, volatility is not the only issue. The ordering ops could also
 be incorrect, e.g., violate the transitivity property. there is no reliable
 way to determine if a function is volatile and/or incorrectly specified.


No it is the only issue. If you create a datatype with volatile function for
ordering ops, then you have the broken data type(the one you are referring
to). So they are one and the same.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
 No, it's far from harmless.  As soon as that heap TID gets filled with
 an unrelated tuple, you run the risk of indexscans alighting on and
 perhaps modifying the wrong tuple.


Tom,
  In the Function based indexes on those functions, which we are
suspecting to be a volatile one Or in the datatypes, which we suspect to be
broken, can we have additional checks to ensure that to ensure that this
does not happen? I mean, do you think, that would solve the issue?

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
On Fri, Feb 26, 2010 at 9:54 AM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:


 No, it's far from harmless.  As soon as that heap TID gets filled with
 an unrelated tuple, you run the risk of indexscans alighting on and
 perhaps modifying the wrong tuple.


 Tom,
 i think this will never happen. The only issue is when we need to go
back to the index from heap. This is to update the timestamps of
update/delete.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram
 No, we're not going there.  That breaks the fundamental page content
 manipulation algorithms, and falls down for tuples not yet stored in a
 page (or being examined without a pointer to the page readily at hand),
 and has no redeeming social value anyway compared to doing it in the
 proven fashion.


Tom,
I was also concerned regarding that, but just thought of informing
you about the option. But i think it will never fall down for tuples not
stored in the page. As we have the offset and the hint bits to mention
whether a tuple is there or not. Only the two byte size field will move down
by my suggestion. But your intuition has the most probability of success.
My concern was that it would make the page of a heap different from
page of a b-tree index.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-25 Thread Gokulakannan Somasundaram


 Proving that a set of comparison operators are consistent just by
 examining their runtime behavior is probably equivalent to solving the
 halting problem.  I can't see us doing it, or wanting to accept the
 overhead of checking it even if it could be done.


The overhead of checking is very minimal. When we update, we have to just
carry the tuple id of the heaptuple and insert transaction id. We check
whether they are same with the index snapshot. If it is not same, then we
will go ahead and start treating this index as either dropped / as a normal
index ( without snapshot ). Since the overhead of dropping / marking it as
normal index will occur very rarely, we need not be concerned about that
performance impact ( i suppose). The overhead of checking is going to be
there only on suspicious user defined functions. ( We can have a flag for
is_suspicious )



 To be a bit more concrete: the typical sort of failure that you could
 get from broken btree operators is failure of transitivity, that is
 the comparators report A  B and B  C for some A, B, C, but do not say
 that A  C when those two values are compared directly.  I don't see any
 convenient way to detect that as a byproduct of normal index operations,
 because you wouldn't typically have a reason to make all three
 comparisons in close proximity.  Indeed, the searching and sorting
 algorithms do their best to avoid making redundant comparisons of that
 kind.

 I am not saying that we should do analysis of runtime behavior. I am saying
that, we would provide a set of built-in functions which will be always
stable (with some flag in pg_proc) . We will scan the provided function for
any functions that are not in the stable set provided, when it gets created.
Now  if the function has one such function, then it is declared as
suspicious to be broken/volatile.

Thanks for the reply,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram


 I think Gokul was asking because he wanted to work on it, but wanted to
 check community approval first.


Yes the problem is that we need to come to a consensus on broken data types.
As Heikki pointed out, those data types, which is based on a unstable
function like time, date, random etc. This is definitely a theoretical
possibility, but still we want to continue building indexes which supports
these features. If we can take a decision regarding this, we can have a
feature like IOT..




  There's  many tricks like column-oriented storage, compression,
  index-organised-tables etc. that would be nice to have. Whether any
  particular feature is worthwhile in the end, the devil is in the details.

 Please consider my following statements from a database tuner perspective.
I don't want to discourage the visibility map feature, but it has the
disadvantages, which we already discussed. While i do a explain analyze and
i see 300 reads, but the same query in production might lead to 400
reads(with all the extra 100 being random i/os), because of the state of the
visibility map. If there is a long batch job running somewhere in the
database, it will affect almost all the visibility maps of the relation. So
how can a person, tune and test a query in dev and put it in production and
be confident about the i/o performance ?  Say Visibility map goes into core
after 9.x, the performance of those databases will be less compared to the
previous release in these circumstances.

All i am trying to say is that the visibility map has cases, where it will
be ineffective and are we deciding to find solutions in those cases.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram


 The fragility there is not an issue in a mostly read-only application,
 but it definitely would be a concern in other cases.


While we accept that visibility map is good for read only application, why
can't we make it optional? Atleast if there is a way for a person to drop
the visibility map for a table(if it gets created by default), the
application need not incur the overhead for those tables, when it knows it
is update intensive /  with batch jobs.

Again not to deviate from my initial question, can we make a decision
regarding unstable/mutable functions / broken data types ?

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
Forgot to include the group..

On Wed, Feb 24, 2010 at 5:38 PM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:

 I am not familiar with this term broken data types, and I just looked for
 it in the source code and couldn't find it.

 What exactly are you referring to?

 cheers

 andrew


 Sorry i missed this. Actually if we create a function A which uses
 functions like time(), date() and random(), then this function A won't give
 the same output, even if we give the same input. So if a person has created
 a data type, which uses these functions, then it can't be made as a primary
 key in an Index organized table, because i need to reach the same tuple by
 applying the function on the supplied values. But since the function is
 mutable, we can't reach the same tuple.

 If we decide to support only datatypes containing immutable functions, then
 there might be people who have created these kind of functions and marked it
 as immutable( while they are mutable functions). So those functions will
 result in index-corruption / failed operation. Only if we resolve this issue
 we can have data structures like IOT.

 Hope, i was clear.

 Thanks,
 Gokul.




Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram


 If you have a scenario where the visibility map incurs a measurable
 overhead, let's hear it. I didn't see any in the tests I performed, but
 it's certainly possible that if the circumstances are just right it
 makes a difference.

 Heikki,
  The obvious one, i could observe is that it would increase the WAL
contention. Am i missing something?  All i am suggesting is to reduce the
unnecessary work required in those tables, where the visibility map is not
required. For example, in data warehouses, people might even have a tables
without any indexes. Why do we ask them to incur the overhead of visibility
map?
  Also since you have made the visibility maps without any page
level locking, have you considered whether it would make sure the correct
order of inserts into the WAL? i have looked at some random threads, but i
couldn't get the complete design of visibility map to be used for index only
scans.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram


 I think you're a barking up the wrong tree.  AFAIUI, the need for the
 visibility map has not very much to do with whether the table has
 indices, and everything to do with avoiding unnecessary VACUUMs.  In
 any event, you've not shown that the visibility map HAS any overhead,
 so talking about skipping it seems entirely premature.  Keep in mind
 that the visibility map is quite small.


OK! i am not saying to remove the visibility map, if i am misunderstood. All
i am saying here is to remove the index only scan processing of visibility
map. If it is being used only for vacuums, you need not make it crash safe
and no WAL comes into picture.



 The point of the visibility map as far as index-only scans are
 concerned is that if all the needed column values can be extracted
 from the index, we still need to read the heap page to check tuple
 visibility - unless, of course, we already know from the visibility
 map that all the tuples on that heap page are guaranteed to be visible
 to all transactions.  On a read-only or read-mostly table, this will
 reduce the cost of checking tuple visibility by several orders of
 magnitude.

 I understand that. As i suggested above, if you have no indexes for a
table, why do you need to spend the extra effort in making it crash safe for
that table? Hope i am clear.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram

  But adding 24 bytes to every index entry seems
 pretty unlikely to be a win anyways.


We actually wanted to make it optional.  Not every index will be like that.
More than that we can make it into 16 bytes. Only commands within the same
transaction will not be able to do a index only scan.


This seems like a non-starter to me. We would lose the option of
 doing sequential scans and the ability to have any other indexes on
 the table. That would be comparable to Oracle circa 1985. We can do
 better with stuff like Heikki's grouped index tuple and the
 visibility map which don't interfere with other features as much.


Sequential scans can be done on IOTs, just scan through the leaf pages. I
think you are talking about IOTs with overflow regions.
As i said already, this serves a different set of options to the DB
Designer.




 I don't think these three are actually related. Afaict neither IOT nor
 visibility information in indexes depend on refinding keys in the
 index. But it's possible I'm missing something. Even so they're still
 three separate issues.

 If we have visibility information in a heap, we need to goto the same index
tuple, whenever there is a update/delete. Now if there is a broken function,
it won't let us reach the index from the heap tuple . Hope you are able to
get it.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram


 With an IOT I don't understand how you get out of index corruption
 without data loss.  That's a showstopper for practical use, I think.


For simplicity, say we are storing all the non-leaf pages of the index in a
seperate file, then the leaf pages are nothing but the table. So if we can
replicate the table, then we can replicate the non-leaf pages (say by some
modified version of reindex).

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
On Wed, Feb 24, 2010 at 10:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Kevin Grittner kevin.gritt...@wicourts.gov writes:
  So you are essentially proposing that rather than moving the heap
  data into the leaf tuples of the index in the index file, you will
  move the leaf index data into the heap tuples?  The pages in such a
  IOT heap file would still need to look a lot like index pages, yes?

  I'm not saying it's a bad idea, but I'm curious what benefits you
  see to taking that approach.

 Isn't that just a variant on Heikki's grouped index tuples idea?

regards, tom lane


No Tom, Grouped index tuple doesn't use the B+ Tree data structure to
achieve the sorting, so it will not guarantee 100% clustering of data.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
 Yes. The visibility map doesn't need any new WAL records to be written.

 We probably will need to add some WAL logging to close the holes with
 crash recovery, required for relying on it for index-only-scans, but
 AFAICS only for VACUUM and probably only one WAL record for a whole
 bunch of heap pages, so it should be pretty insignificant.


Hmmm   So whenever the update transaction changes a page, it will update
the visibility map, but won't enter the WAL record.
 So after the crash we have a visibility map, which has false positives.
Isn't that wrong?



 Let me repeat myself: if you think the overhead of a visibility map is
 noticeable or meaningful in any scenario, the onus is on you to show
 what that scenario is. I am not aware of such a scenario, which doesn't
 mean that it doesn't exist, of course, but hand-waving is not helpful.


Well as a DB Tuner, i am requesting to make it a optional feature. If you
and everyone else feel convinced, consider my request.




 I'm not sure what you mean with without any page level locking.
 Whenever a visibility map page is read or modified, a lock is taken on
 the buffer.


OK. I thought you are following some kind of lock-less algorithm there.
Then updaters/deleters of multiple pages might be waiting on the same lock
and hence there is a chance of a contention there right?  Again correct me,
if i am wrong ( i might have understood things incorrectly )

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
Missed the group...

On Thu, Feb 25, 2010 at 12:34 AM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:



 On Thu, Feb 25, 2010 at 12:28 AM, Gokulakannan Somasundaram 
 gokul...@gmail.com wrote:


 That doesn't work because when you split an index page any sequential
 scan in progress will either see the same tuples twice or will miss
 some tuples depending on where the new page is allocated. Vacuum has a
 clever trick for solving this but it doesn't work for arbitrarily many
 concurrent scans.

 Consider how the range scans are working today, while the page split
 happens.

 The Seq scan should follow the right sibling to do the seq scan.

 Gokul.


 Actually thinking about what you suggested for a while, i think it should
 be possible, because the Oracle Fast Full Index scan essentially scans the
 index like that. I will try to think a way of doing that with Lehman and
 Yao...

 Gokul.



Fwd: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
Missed the group



 So basically you want to have index-only scans, but you want them to
 be really slow?


No Robert, i just asked him to make it optional, so that the crash safe
visibility map is used only when its need is realized by the DB designer.
When there is a table with no indexes/ a table where the data will be often
updated/ a database where there will be a batch job together with OLTP
stuff, the visibility map is just useless. So i am requesting Heikki to make
it optional. I am not asking him to drop that idea, as it might be useful in
read only environments.

Hope, i have not offended you in conveying that.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
 It might be slightly easier given the assumption that you only want to
 scan leaf tuples.

 But there's an additional problem I didn't think of before. Currently
 we optimize index scans by copying all relevant tuples to local memory
 so we don't need to hold an index lock for an extended time or spend a
 lot of time relocking and rechecking the index for changes. That
 wouldn't be possible if we needed to get visibility info from the page
 since we would need up-to-date information.


 We should solve this issue in the same way, of how we proceed with the
index only quals, in current index-only scans.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram

 That doesn't work because when you split an index page any sequential
 scan in progress will either see the same tuples twice or will miss
 some tuples depending on where the new page is allocated. Vacuum has a
 clever trick for solving this but it doesn't work for arbitrarily many
 concurrent scans.

 Consider how the range scans are working today, while the page split
 happens.

 The Seq scan should follow the right sibling to do the seq scan.

 Gokul.


 Actually thinking about what you suggested for a while, i think it should
 be possible, because the Oracle Fast Full Index scan essentially scans the
 index like that. I will try to think a way of doing that with Lehman and
 Yao...

 Gokul.


OK I think, i can think of a solution to achieve fast full index scan like
oracle.
a) Issue ids to every block that gets created inside the index. we are
already doing that
b) Now before the fast full index scan starts, note down the max id that got
issued.
c) Now do a scan similar to Full Table Scan till that max id. Now, while we
are scanning the blocks, note down the right siblings in a list, if the
right sibling block id is greater than the max id that got issued. These are
the ones, which have got split after the scan started
d) Now after we reach that max block, try to range scan on missing links
till we hit the end / get a right sibling less than the max block id noted.

Once we do this for all the missing links, we have scanned through the
entire chain. So this will definitely be faster than range scan.

Thanks to you, i have thought about an important issue and also a solution
for it.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
 You also need to avoid scanning the same tuple twice. That's not a
 problem for VACUUM, but it is for full index scans.

 Heikki,
  Actually the logic, which i have explained earlier is to avoid
scanning tuples twice.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
 I haven't thought about whether this is sufficient but if it is then
 an initial useful thing to add would be to use it for queries where we
 have a qual that can be checked using the index key even though we
 can't do a range scan. For example if you have a btree index on
 a,b,c and you have a WHERE clause like WHERE c=0

 That would be a much smaller change than IOT but it would still be a
 pretty big project. Usually the hardest part is actually putting the
 logic in the planner to determine whether it's advantageous. I would
 suggest waiting until after 9.0 is out the door to make sure you have
 the attention of Heikki or Tom or someone else who can spend the time
 to check that it will actually work before putting lots of work coding
 it.

 I will try that. Thanks ...


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
 The WAL record of the heap insert/update/delete contains a flag
 indicating that the visibility map needs to be updated too. Thus no need
 for a separate WAL record.


Heikki,
Have you considered these cases?
a) The current WAL architecture makes sure that the WAL Log is written
before the actual page flush( i believe ). But you are changing that
architecture  for Visibility maps. Visibility map might get flushed out
before the corresponding WAL gets written. I think you would then suggest
full page writes here
b) Say for a large table, you have multiple buffers of visibility map, then
there is a chance that one buffer gets flushed to the disk and the other
doesn't. If the WAL records are not in place, then this leads to a time
inconsistent visibility map.
c) If you are going to track all the WAL linked with a buffer of visibility
map, then you need to introduce another synchronization in the critical
path.

May be i am missing something? I am asking these questions only out of
curiosity.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-24 Thread Gokulakannan Somasundaram
On Thu, Feb 25, 2010 at 3:16 AM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:


 I haven't thought about whether this is sufficient but if it is then
 an initial useful thing to add would be to use it for queries where we
 have a qual that can be checked using the index key even though we
 can't do a range scan. For example if you have a btree index on
 a,b,c and you have a WHERE clause like WHERE c=0

 That would be a much smaller change than IOT but it would still be a
 pretty big project. Usually the hardest part is actually putting the
 logic in the planner to determine whether it's advantageous. I would
 suggest waiting until after 9.0 is out the door to make sure you have
 the attention of Heikki or Tom or someone else who can spend the time
 to check that it will actually work before putting lots of work coding
 it.

 I will try that. Thanks ...


Some more ideas popped up. I am just recording those.
a) In place of block id( this has to be issued for every new/recycled block
and it is not there in postgres), we can even have SnapshotNow's transaction
id. I just feel the synchronization effect will be more here.
b) We can just record the currentTimestamp in the page. While this is
without any synch, it might create problems, when we decide to go for
Master-Master replication and Distributed databases. So when such things
happens, the clock on the various systems have to be synched.

Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-23 Thread Gokulakannan Somasundaram
On Tue, Feb 23, 2010 at 10:42 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Takahiro Itagaki itagaki.takah...@oss.ntt.co.jp writes:
  Instead, how about excluding columns in primary keys from table data?

 How will you implement select * from mytable ?  Or even
 select * from mytable where non_primary_key = something ?
 If you can't do either of those without great expense, I think
 a savings on primary-key lookups is not going to be adequate
 recompense.


Tom,
  I am talking things more from the perspective of how things have got
implemented in Oracle/SQL Server. Both Oracle and SQL Server store the
snapshot info with indexes and hence can do index-only scans with their
indexes. But still they have the concept of Index Organized Tables /
Clustered Indexes. Apart from the disk footprint, it will have an impact on
the cache efficiency also.
   In Oracle IOT and SQL Server Clustered Indexes, you have an option to
store some of the columns in the leaf pages( but not in the non-leaf pages)
and hence the tuples won't get sorted based on them, but you don't require
an extra i/o to access them. This optimization is again to reduce the size
of IOT. Oracle IOT has a concept called overflow regions, which is more like
a heap and will store a few columns. There will be a pointer from main
b-tree structure to this secondary structure. Accessing these columns are
costly, but the idea is that the database designer has taken this into
account while deciding on the columns to be put in the overflow regions.
   We can design secondary indexes to make the access faster for
non-primary key based searches. But since the Secondary indexes store
primary key in the place of HeapTuple Pointer, the access will usually take
2-3 more i/os. But the idea is that the IOT is for those kind of data. which
will be 99% queried based on primary keys. The database provides that extra
performance for that kind of access patterns. So to answer your question,
full table scans(if overflow regions are involved) and search based on
non-primary keys will be slow in an IOT.
 I looked at the postgres nbtree code. From my analysis(which might
be wrong!), we can implement IOTs, provided we make a decision on broken
data types issue.

Thanks,
Gokul.


Re: [HACKERS] A thought on Index Organized Tables

2010-02-22 Thread Gokulakannan Somasundaram
Heikki,
 I had a quick look at the discussion on visibility map design. The
main differences as i see it are
a) IOT has both table and index in one structure. So no duplication of data
b) With visibility maps, we have three structures a) Table b) Index c)
Visibility map. So the disk footprint of the same data will be higher in
postgres ( 2x + size of the visibility map).
c) More than that, inserts and updates will incur two extra random i/os
every time. - one for updating the table and one for updating the visibility
map.

Are we going to ignore these differences?

Thanks,
Gokul.

On Mon, Feb 22, 2010 at 12:21 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 Gokulakannan Somasundaram wrote:
  Hi,
  As you all know, Index Organized tables are a way by which we can
  automatically cluster the data based on the primary key. While i was
  thinking about an implementation for postgres, it looks like an
 impossible
  with the current ideologies. In an IOT, if a record gets updated, we need
 to
  mark the old row as deleted immediately, as we do with any other table.
 But
  since Postgres supports user defined data types and if they happen to be
 a
  broken data type, then we have an unstable IOT.(as there is no guarantee,
 we
  might hit the same record)
  This was the reason for which, the proposal on creating  indexes with
  snapshot was rejected.
  May i get a little clarification on this issue? Will we be supporting
  the IOT feature in postgres in future?

 What seems like the best path to achieve the kind of performance
 benefits that IOTs offer is allowing index-only-scans using the
 visibility map. I worked on that last summer, but unfortunately didn't
 have the time to finish anything.

 --
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com



Re: [HACKERS] A thought on Index Organized Tables

2010-02-22 Thread Gokulakannan Somasundaram
Forgot to include the group...

On Mon, Feb 22, 2010 at 4:31 PM, Gokulakannan Somasundaram 
gokul...@gmail.com wrote:



 These sound like the same point to me. I don't think we're concerned
 with footprint -- only with how much of that footprint actually needs
 to be scanned. So if we have a solution allowing the scan to only need
 to look at the index then the extra footprint of the table doesn't
 cost anything at run-time. And the visibility map is very small.


 Yep.. They are one and the same...
 Just wanted a clarification on the design goals going forward.

 Thanks,
 Gokul.




[HACKERS] A thought on Index Organized Tables

2010-02-21 Thread Gokulakannan Somasundaram
Hi,
As you all know, Index Organized tables are a way by which we can
automatically cluster the data based on the primary key. While i was
thinking about an implementation for postgres, it looks like an impossible
with the current ideologies. In an IOT, if a record gets updated, we need to
mark the old row as deleted immediately, as we do with any other table. But
since Postgres supports user defined data types and if they happen to be a
broken data type, then we have an unstable IOT.(as there is no guarantee, we
might hit the same record)
This was the reason for which, the proposal on creating  indexes with
snapshot was rejected.
May i get a little clarification on this issue? Will we be supporting
the IOT feature in postgres in future?

Thanks,
Gokul.


[HACKERS] Precedence of Postfix operators

2010-02-07 Thread Gokulakannan Somasundaram
Hi,
   Is there any reason why we have given lesser precedence for postfix
operator compared to multiplication/division? Usually postfix operators have
more precedence than the binary operations. Is this some kind of work around
to provide user-defined operators? Can someone help me understand this?

Thanks,
Gokul.


  1   2   3   >