[jira] [Commented] (MAILBOX-44) [gsoc2011] Design and implement a distributed mailbox using Hadoop

stack (JIRA) Tue, 14 Jun 2011 14:41:57 -0700

    [ 
https://issues.apache.org/jira/browse/MAILBOX-44?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049445#comment-13049445
 ]


stack commented on MAILBOX-44:
------------------------------

@Loan Going the Gora route will allow you swap stores.   I've not used it so am 
not up on the costs that come with the indirection (if any). 

You'll need to figure a schema design for your store.  I'd suggest you study 
how James does queries currently and make a list.  This will be the key input 
feeding your schema design.  For example, in the coming "HBase: The Definitive 
Guide", Lars has some discussion of HBase as a mail store.  Rows are sorted in 
HBase so he arrives at a row key schema that looks like this:

{code}
<userid><date in reversed chronological order so you see newest mail 
first><message-id><attachment-id>
{code}

You can start up a scan to see all mail from a user and you'll see the latest 
first. Mail will be grouped by mail id.  If attachments ids are their sequence 
number, then they'll be encountered in order (you'll probably need to zero pad 
some of the attributes above).  This is just an example.  You may end up w/ 
different row key design after you've studied James queries.



> [gsoc2011] Design and implement a distributed mailbox using Hadoop
> ------------------------------------------------------------------
>
>                 Key: MAILBOX-44
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-44
>             Project: James Mailbox
>          Issue Type: New Feature
>            Reporter: Eric Charles
>            Assignee: Norman Maurer
>              Labels: gsoc2011
>             Fix For: 0.3
>
>
> Context: The mailbox subproject (http://james.apache.org/mailbox/) supports 
> maildir, SQL database (via JPA) and Java Content Repository (JCR) as 
> technology for mail storage. This flexibility is achieved thanks to a API 
> design that abstracts mail storage from the mail protocols.
> Task: We need to implement mailbox storage as a distributed system on top of 
> Hadoop HDFS. The James mailbox API will be used. A first step is to design 
> how to interact with Hadoop (native api, gora incubator at apache,...) and 
> deal with specific performance questions related to mail loading/parsing in a 
> distributed system (use map/reduce or not, use existing local lucene indexes 
> for search,...). The second step is to implement the HDFS mailbox (maildir 
> mailbox is similar because is stores mails as a file and can be an 
> inspiration). A single James server will still be deployed because we don't 
> have any distributed UID generation.
> Mentor: eric at apache dot org
> Complexity: medium 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAILBOX-44) [gsoc2011] Design and implement a distributed mailbox using Hadoop

Reply via email to