Re: [HACKERS] GSoC on WAL-logging hash indexes

2014-03-03 Thread Tan Tran
Thanks for alerting me to your previous idea. While I don't know enough
about Postgresql internals to judge its merits yet, I'll write some
pseudocode based on it in my proposal; and I'll relegate it to a reach
proposal alongside a more straightforward one.

Tan

On Mon, Mar 3, 2014 at 8:12 AM, Robert Haas robertmh...@gmail.com wrote:

 On Sun, Mar 2, 2014 at 2:38 PM, Tan Tran tankimt...@gmail.com wrote:
  2. Proposal
  As a GSoC student, I will implement WAL recovery of hash indexes using
 the
  other index types' WAL code as a guide. Roughly, I will:
  - Devise a way to store and retrieve hashing data within the XLog data
  structures.
  - In the existing skeleton for hash_redo(XLogRecPtr lsn, XLogRecord
 *record)
  in hash.c, branch to code for the various redo operations: creating an
  index, inserting into an index, deleting an index, and page operations
  (split, delete, update?).
  - Code each branch by drawing on examples from btree_redo, gin_redo, and
  gist_redo, the existing XLog code of the other index types.

 Unfortunately, I don't believe that it's possible to do this easily
 today because of the way bucket splits are handled.  I wrote about
 this previously here, with an idea for solving the problem:


 http://www.postgresql.org/message-id/ca+tgmozymojsrfxhxq06g8jhjxqcskvdihb_8z_7nc7hj7i...@mail.gmail.com

 Sadly, no one responded.  :-(

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-03-02 Thread Tan Tran
Hi Greg, pgsql-advocacy, and pgsql-hackers,

I'm interested in doing my GSoC project on this idea. I'm new to indexing and 
WAL, which I haven't encountered in my classes, but it sounds interesting and 
valuable to Postgresql. So here's my draft proposal. Do you mind giving your 
opinion and corrections? With your help I'll add some technical detail to my 
plans.

Thanks,
Tan Tran

Introduction
In write-ahead logging (WAL), all modifications to a database are 
written to a write-ahead log before being flushed to disk at periodic 
checkpoints. This method saves I/O operations, enables a continuous backup, 
and, in the case of database failure, guarantees data integrity up until the 
last saved checkpoint. In Postgresql’s implementation, transactions are written 
to XLog, which is divided into 16MB files (“segments”) that together comprise a 
complete history of transactions. Transactions are continually appended to the 
latest segment, while checkpointing continually archives segments up until the 
last checkpoint. Internally, a suite of XLog structures and functions 
interfaces with the various resource managers so they can log a sufficient 
amount of data to restore data (“redo”) in case of failure.
Another Postgresql feature is the creation of indexes on a invariant 
custom field; for example, on the LastName of a Person even though the primary 
key is ID. These custom indexes speed up row lookup. Postgres currently 
supports four index types: B-tree, GiST, and GIN, and hash. Indexes on the 
former three are WAL-recoverable, but hashing is not.

2. Proposal
As a GSoC student, I will implement WAL recovery of hash indexes using 
the other index types’ WAL code as a guide. Roughly, I will:
- Devise a way to store and retrieve hashing data within the XLog data 
structures. 
- In the existing skeleton for hash_redo(XLogRecPtr lsn, XLogRecord *record) in 
hash.c, branch to code for the various redo operations: creating an index, 
inserting into an index, deleting an index, and page operations (split, delete, 
update?).
- Code each branch by drawing on examples from btree_redo, gin_redo, and 
gist_redo, the existing XLog code of the other index types.

Benefits
Hash index searching is O(1), which is asymptotically faster than the O(n lg n) 
searching of a B-tree, and does not require custom indexing functions like GIN 
and GIST inherently do. Therefore it is desirable for rows that will only be 
retrieved on an equality or inequality relation. However, two things currently 
stand in the way of its popular use. From the Postgresql documentation,
“Hash index operations are not presently WAL-logged, so hash indexes 
might need to be rebuilt with REINDEX after a database crash if there were 
unwritten changes. Also, changes to hash indexes are not replicated over 
streaming or file-based replication after the initial base backup, so they give 
wrong answers to queries that subsequently use them. For these reasons, hash 
index use is presently discouraged.”
My project would solve the first problem, after which I would like to stay on 
and fix the second.

To be written: Quantifiable Results, Schedule, Completeness Criteria, Bio


On Feb 28, 2014, at 6:21 AM, Greg Stark st...@mit.edu wrote:

 On Tue, Jan 28, 2014 at 5:34 PM, Thom Brown t...@linux.com wrote:
 Who would be up for mentoring this year?  And are there any project
 ideas folk would like to suggest?
 
 I mentored in the past and felt I didn't do a very good job because I
 didn't really understand the project the student was working on.
 
 There's precisely one project that I feel I would be competent to
 mentor at this point. Making hash indexes WAL recoverable. This is
 something that's easy to define the scope of and easy to determine if
 the student is on track and easy to measure when finished. It's
 something where as far as I can tell all the mentor work will be
 purely technical advice.
 
 Also it's something the project really really needs and is perfectly
 sized for a GSOC project IMHO. Also it's a great project for a student
 who might be interested in working on Postgres in the future since it
 requires learning all our idiosyncratic build and source conventions
 but doesn't require huge or controversial architectural changes.
 
 I fear a number of items in the Wiki seem unrealistically large
 projects for GSOC IMNSHO.
 
 -- 
 greg
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] [pgsql-advocacy] GSoC 2014 - mentors, students and admins

2014-03-02 Thread Tan Tran
Earlier I posted an email to this thread that I realize hijacked the 
discussion. Please continue replying to here instead.
 
On Feb 28, 2014, at 6:59 AM, Karol Trzcionka karl...@gmail.com wrote:

 W dniu 27.02.2014 22:25, Thom Brown pisze:
 On 27 February 2014 21:08, David Fetter da...@fetter.org wrote:
 For MADlib, no.  Are you asking for mentors in general?
 
 Ah yes, I should clarify.  Yes, mentors in general.
 In general I can help but I'm not sure if I'm not too fresh in pgsql ;) 
 However after GSOC as student I can try the another side.
 Regards,
 Karol



[HACKERS] GSoC on WAL-logging hash indexes

2014-03-02 Thread Tan Tran
Hi all,

Earlier I posted this in the wrong thread. Please excuse the double posting.

Tan Tran

Begin forwarded message:

 From: Tan Tran tankimt...@gmail.com
 Subject: Re: [HACKERS] GSoC 2014 - mentors, students and admins
 Date: March 2, 2014 at 5:03:14 AM PST
 To: Greg Stark st...@mit.edu
 Cc: pgsql-advocacy pgsql-advoc...@postgresql.org, PostgreSQL-development 
 pgsql-hackers@postgresql.org
 
 Hi Greg, pgsql-advocacy, and pgsql-hackers,
 
 I'm interested in doing my GSoC project on this idea. I'm new to indexing and 
 WAL, which I haven't encountered in my classes, but it sounds interesting and 
 valuable to Postgresql. So here's my draft proposal. Do you mind giving your 
 opinion and corrections? With your help I'll add some technical detail to my 
 plans.
 
 Thanks,
 Tan Tran
 
 Introduction
   In write-ahead logging (WAL), all modifications to a database are 
 written to a write-ahead log before being flushed to disk at periodic 
 checkpoints. This method saves I/O operations, enables a continuous backup, 
 and, in the case of database failure, guarantees data integrity up until the 
 last saved checkpoint. In Postgresql’s implementation, transactions are 
 written to XLog, which is divided into 16MB files (“segments”) that together 
 comprise a complete history of transactions. Transactions are continually 
 appended to the latest segment, while checkpointing continually archives 
 segments up until the last checkpoint. Internally, a suite of XLog structures 
 and functions interfaces with the various resource managers so they can log a 
 sufficient amount of data to restore data (“redo”) in case of failure.
   Another Postgresql feature is the creation of indexes on a invariant 
 custom field; for example, on the LastName of a Person even though the 
 primary key is ID. These custom indexes speed up row lookup. Postgres 
 currently supports four index types: B-tree, GiST, and GIN, and hash. Indexes 
 on the former three are WAL-recoverable, but hashing is not.
 
 2. Proposal
   As a GSoC student, I will implement WAL recovery of hash indexes using 
 the other index types’ WAL code as a guide. Roughly, I will:
 - Devise a way to store and retrieve hashing data within the XLog data 
 structures. 
 - In the existing skeleton for hash_redo(XLogRecPtr lsn, XLogRecord *record) 
 in hash.c, branch to code for the various redo operations: creating an index, 
 inserting into an index, deleting an index, and page operations (split, 
 delete, update?).
 - Code each branch by drawing on examples from btree_redo, gin_redo, and 
 gist_redo, the existing XLog code of the other index types.
 
 Benefits
 Hash index searching is O(1), which is asymptotically faster than the O(n lg 
 n) searching of a B-tree, and does not require custom indexing functions like 
 GIN and GIST inherently do. Therefore it is desirable for rows that will only 
 be retrieved on an equality or inequality relation. However, two things 
 currently stand in the way of its popular use. From the Postgresql 
 documentation,
   “Hash index operations are not presently WAL-logged, so hash indexes 
 might need to be rebuilt with REINDEX after a database crash if there were 
 unwritten changes. Also, changes to hash indexes are not replicated over 
 streaming or file-based replication after the initial base backup, so they 
 give wrong answers to queries that subsequently use them. For these reasons, 
 hash index use is presently discouraged.”
 My project would solve the first problem, after which I would like to stay on 
 and fix the second.
 
 To be written: Quantifiable Results, Schedule, Completeness Criteria, Bio
 
 
 On Feb 28, 2014, at 6:21 AM, Greg Stark st...@mit.edu wrote:
 
 On Tue, Jan 28, 2014 at 5:34 PM, Thom Brown t...@linux.com wrote:
 Who would be up for mentoring this year?  And are there any project
 ideas folk would like to suggest?
 
 I mentored in the past and felt I didn't do a very good job because I
 didn't really understand the project the student was working on.
 
 There's precisely one project that I feel I would be competent to
 mentor at this point. Making hash indexes WAL recoverable. This is
 something that's easy to define the scope of and easy to determine if
 the student is on track and easy to measure when finished. It's
 something where as far as I can tell all the mentor work will be
 purely technical advice.
 
 Also it's something the project really really needs and is perfectly
 sized for a GSOC project IMHO. Also it's a great project for a student
 who might be interested in working on Postgres in the future since it
 requires learning all our idiosyncratic build and source conventions
 but doesn't require huge or controversial architectural changes.
 
 I fear a number of items in the Wiki seem unrealistically large
 projects for GSOC IMNSHO.
 
 -- 
 greg
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http