Hi Greg, pgsql-advocacy, and pgsql-hackers,
I'm interested in doing my GSoC project on this idea. I'm new to indexing and
WAL, which I haven't encountered in my classes, but it sounds interesting and
valuable to Postgresql. So here's my draft proposal. Do you mind giving your
opinion and corrections? With your help I'll add some technical detail to my
plans.
Thanks,
Tan Tran
Introduction
In write-ahead logging (WAL), all modifications to a database are
written to a write-ahead log before being flushed to disk at periodic
checkpoints. This method saves I/O operations, enables a continuous backup,
and, in the case of database failure, guarantees data integrity up until the
last saved checkpoint. In Postgresql’s implementation, transactions are written
to XLog, which is divided into 16MB files (“segments”) that together comprise a
complete history of transactions. Transactions are continually appended to the
latest segment, while checkpointing continually archives segments up until the
last checkpoint. Internally, a suite of XLog structures and functions
interfaces with the various resource managers so they can log a sufficient
amount of data to restore data (“redo”) in case of failure.
Another Postgresql feature is the creation of indexes on a invariant
custom field; for example, on the LastName of a Person even though the primary
key is ID. These custom indexes speed up row lookup. Postgres currently
supports four index types: B-tree, GiST, and GIN, and hash. Indexes on the
former three are WAL-recoverable, but hashing is not.
2. Proposal
As a GSoC student, I will implement WAL recovery of hash indexes using
the other index types’ WAL code as a guide. Roughly, I will:
- Devise a way to store and retrieve hashing data within the XLog data
structures.
- In the existing skeleton for hash_redo(XLogRecPtr lsn, XLogRecord *record) in
hash.c, branch to code for the various redo operations: creating an index,
inserting into an index, deleting an index, and page operations (split, delete,
update?).
- Code each branch by drawing on examples from btree_redo, gin_redo, and
gist_redo, the existing XLog code of the other index types.
Benefits
Hash index searching is O(1), which is asymptotically faster than the O(n lg n)
searching of a B-tree, and does not require custom indexing functions like GIN
and GIST inherently do. Therefore it is desirable for rows that will only be
retrieved on an equality or inequality relation. However, two things currently
stand in the way of its popular use. From the Postgresql documentation,
“Hash index operations are not presently WAL-logged, so hash indexes
might need to be rebuilt with REINDEX after a database crash if there were
unwritten changes. Also, changes to hash indexes are not replicated over
streaming or file-based replication after the initial base backup, so they give
wrong answers to queries that subsequently use them. For these reasons, hash
index use is presently discouraged.”
My project would solve the first problem, after which I would like to stay on
and fix the second.
To be written: Quantifiable Results, Schedule, Completeness Criteria, Bio
On Feb 28, 2014, at 6:21 AM, Greg Stark st...@mit.edu wrote:
On Tue, Jan 28, 2014 at 5:34 PM, Thom Brown t...@linux.com wrote:
Who would be up for mentoring this year? And are there any project
ideas folk would like to suggest?
I mentored in the past and felt I didn't do a very good job because I
didn't really understand the project the student was working on.
There's precisely one project that I feel I would be competent to
mentor at this point. Making hash indexes WAL recoverable. This is
something that's easy to define the scope of and easy to determine if
the student is on track and easy to measure when finished. It's
something where as far as I can tell all the mentor work will be
purely technical advice.
Also it's something the project really really needs and is perfectly
sized for a GSOC project IMHO. Also it's a great project for a student
who might be interested in working on Postgres in the future since it
requires learning all our idiosyncratic build and source conventions
but doesn't require huge or controversial architectural changes.
I fear a number of items in the Wiki seem unrealistically large
projects for GSOC IMNSHO.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers