[ 
https://issues.apache.org/jira/browse/MAILBOX-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220006#comment-13220006
 ] 

Ioan Eugen Stan commented on MAILBOX-170:
-----------------------------------------

Hello Eric, long post ahead :)

First, could you please explain more what you meant about efficiently query the 
mailbox? I don't follow. 

Second, I don't believe a pure HBase implementation is the best. Let me explain 
why: HBase can't handle large emails and storing them inside Base will lead to 
performance issues (i have some experience with this while working for my 
current employer). That's why I'm planning to move the message  implementation 
to HDFS.  

Basically I wish to create an mbox on steroids -> replicated mbox that can 
provide indexed access to messages. I plan to store mailboxes as SequanceFiles 
and store in HBase the offset of the key-value pair that stores the message. 

Message additions will be appends and we will use ZK locking to sync write 
access between multiple instances of James. Deletes will be instant markers + 
MR jobs that do permanent clean-up: create a copy of the old file with just the 
messages that are not deleted + update the references in HBase. Reads will be 
done by opening the file do a seek and retrieve the message. I plan to mimic in 
HBase the hadoop MapFile. I don't wish to use the MapFile directly because it 
uses two files instead of one (each file uses 150 bytes or RAM + one block, so 
not good with millions of mailboxes, especially when we have HBase).  All the 
metadata will be stored in HBase like it is now, for fast access, the same will 
be (maybe) for message headers.

Messages will be stored with UID as key (they are ascending) and this means we 
can also iterate over them for bulk loads.
Also, because a file is stored in HDFS and replicated, we can have good 
performance since readers can access it from many nodes. I have to see the 
messages access pattern to optimize this. replication is done per file so we 
can replicate frequent accessed mailboxes more times than usual => good 
performance on reads because we can read in parallel => they are immutable ;). 

I plan to implement a special type of Writable that will allow us to stream the 
message from HBase and avoid loading all the message in memory. BytesWritable 
is fine for start, but uses readFully to load the whole value of a sequence 
file == our message so big messages will cause problems.

I plan to use the hadoop FileSystem class so we will use the distribuited 
filesystem HBase will use => this means the implementation could run on any 
distribuited fs supported by hbase. 

I also think HBase is intimately tied with Hadoop and things will not change in 
the near future so not taking advantage of that is kind of a dumb thing to do. 

Basically that's all, with enough free time I think we can make James run in 
clustering. 

Cheers, 


                
> Store mailboxes in HDFS SequenceFile
> ------------------------------------
>
>                 Key: MAILBOX-170
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-170
>             Project: James Mailbox
>          Issue Type: Improvement
>          Components: hbase
>    Affects Versions: 0.4
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>             Fix For: 0.5
>
>
> The current implementation stores messages directly in HBase. I believe a 
> better approach is to store the messages as SequenceFiles in the <mail_ID>: 
> <message_data>. HBase will store sequence File offests in the SequenceFile 
> for each mailbox for fast access similar to a hadoop MapFile.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to