[jira] [Commented] (MAILBOX-44) [gsoc2011] Design and implement a distributed mailbox using Hadoop

Eric Charles (JIRA) Fri, 08 Apr 2011 05:34:46 -0700

    [ 
https://issues.apache.org/jira/browse/MAILBOX-44?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017427#comment-13017427
 ]


Eric Charles commented on MAILBOX-44:
-------------------------------------

Regarding : "Another problem to settle is the format and compression of the 
HDFS files to store the emails", an option would be avro (other otpions would 
be to use the different native hdfs file type, or to develope a MailHadoopFile).

>From http://avro.apache.org/docs/current/, Avro provides:

    Rich data structures.
    A compact, fast, binary data format.
    A container file, to store persistent data.
    Remote procedure call (RPC).
    Simple integration with dynamic languages. Code generation is not required 
to read or write data files nor to use or implement RPC protocols. Code 
generation as an optional optimization, only worth implementing for statically 
typed languages.

The nice thing is that you define your format in JSON and you get for free the 
persistent of your object in hadoop (direct + via map/reduce).

Twitter uses for example similar mechanism to store their tweets (very small 
objects) in their distibuted store.

To be tested/compared with other alternatives...

Would be cool to inject this in your application.tks,

> [gsoc2011] Design and implement a distributed mailbox using Hadoop
> ------------------------------------------------------------------
>
>                 Key: MAILBOX-44
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-44
>             Project: James Mailbox
>          Issue Type: New Feature
>            Reporter: Eric Charles
>            Assignee: Norman Maurer
>              Labels: gsoc2011
>
> Context: The mailbox subproject (http://james.apache.org/mailbox/) supports 
> maildir, SQL database (via JPA) and Java Content Repository (JCR) as 
> technology for mail storage. This flexibility is achieved thanks to a API 
> design that abstracts mail storage from the mail protocols.
> Task: We need to implement mailbox storage as a distributed system on top of 
> Hadoop HDFS. The James mailbox API will be used. A first step is to design 
> how to interact with Hadoop (native api, gora incubator at apache,...) and 
> deal with specific performance questions related to mail loading/parsing in a 
> distributed system (use map/reduce or not, use existing local lucene indexes 
> for search,...). The second step is to implement the HDFS mailbox (maildir 
> mailbox is similar because is stores mails as a file and can be an 
> inspiration). A single James server will still be deployed because we don't 
> have any distributed UID generation.
> Mentor: eric at apache dot org
> Complexity: medium 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAILBOX-44) [gsoc2011] Design and implement a distributed mailbox using Hadoop

Reply via email to