[
https://issues.apache.org/jira/browse/MAILBOX-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220006#comment-13220006
]
Ioan Eugen Stan commented on MAILBOX-170:
-----------------------------------------
Hello Eric, long post ahead :)
First, could you please explain more what you meant about efficiently query the
mailbox? I don't follow.
Second, I don't believe a pure HBase implementation is the best. Let me explain
why: HBase can't handle large emails and storing them inside Base will lead to
performance issues (i have some experience with this while working for my
current employer). That's why I'm planning to move the message implementation
to HDFS.
Basically I wish to create an mbox on steroids -> replicated mbox that can
provide indexed access to messages. I plan to store mailboxes as SequanceFiles
and store in HBase the offset of the key-value pair that stores the message.
Message additions will be appends and we will use ZK locking to sync write
access between multiple instances of James. Deletes will be instant markers +
MR jobs that do permanent clean-up: create a copy of the old file with just the
messages that are not deleted + update the references in HBase. Reads will be
done by opening the file do a seek and retrieve the message. I plan to mimic in
HBase the hadoop MapFile. I don't wish to use the MapFile directly because it
uses two files instead of one (each file uses 150 bytes or RAM + one block, so
not good with millions of mailboxes, especially when we have HBase). All the
metadata will be stored in HBase like it is now, for fast access, the same will
be (maybe) for message headers.
Messages will be stored with UID as key (they are ascending) and this means we
can also iterate over them for bulk loads.
Also, because a file is stored in HDFS and replicated, we can have good
performance since readers can access it from many nodes. I have to see the
messages access pattern to optimize this. replication is done per file so we
can replicate frequent accessed mailboxes more times than usual => good
performance on reads because we can read in parallel => they are immutable ;).
I plan to implement a special type of Writable that will allow us to stream the
message from HBase and avoid loading all the message in memory. BytesWritable
is fine for start, but uses readFully to load the whole value of a sequence
file == our message so big messages will cause problems.
I plan to use the hadoop FileSystem class so we will use the distribuited
filesystem HBase will use => this means the implementation could run on any
distribuited fs supported by hbase.
I also think HBase is intimately tied with Hadoop and things will not change in
the near future so not taking advantage of that is kind of a dumb thing to do.
Basically that's all, with enough free time I think we can make James run in
clustering.
Cheers,
> Store mailboxes in HDFS SequenceFile
> ------------------------------------
>
> Key: MAILBOX-170
> URL: https://issues.apache.org/jira/browse/MAILBOX-170
> Project: James Mailbox
> Issue Type: Improvement
> Components: hbase
> Affects Versions: 0.4
> Reporter: Ioan Eugen Stan
> Assignee: Ioan Eugen Stan
> Fix For: 0.5
>
>
> The current implementation stores messages directly in HBase. I believe a
> better approach is to store the messages as SequenceFiles in the <mail_ID>:
> <message_data>. HBase will store sequence File offests in the SequenceFile
> for each mailbox for fast access similar to a hadoop MapFile.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]