[ https://issues.apache.org/jira/browse/MAILBOX-44?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017441#comment-13017441 ]
Robert Burrell Donkin commented on MAILBOX-44: ---------------------------------------------- A distributed email server is an interesting topic :-) There are a number of different ways which might reasonably approach the problem. Take a look at the way UIDs are defined in IMAP [1]. The strong uniqueness qualities may only be required within a mailbox, not universally. Though mailboxes can be shared, requirements for maintenance message sequence number limit how well concurrency access to a single mailbox will scale. This suggests to me that the framers of the IMAP standard considered the possibility that distribution might happen between the protocol and mailbox tiers. In the scenario, the servers handling client connections and handling mailboxes would operate in separate processes, potentially separated by a network. Each mailbox could then be located close to dedicated storage. I believe that a consequence of this engineering decision by the standards group may be that a fully distributed UID may be not really be necessary. I suspect that using HBase[3] or Cassandra [4] to store UIVALIDITY+UID keyed by mailbox name (perhaps using Gora[5]) would be good enough. [1] http://tools.ietf.org/html/rfc3501 2.3.1.1. Unique Identifier (UID) Message Attribute A 32-bit value assigned to each message, which when used with the unique identifier validity value (see below) forms a 64-bit value that MUST NOT refer to any other message in the mailbox or any subsequent mailbox with the same name forever. Unique identifiers are assigned in a strictly ascending fashion in the mailbox; as each message is added to the mailbox it is assigned a higher UID than the message(s) which were added previously. Unlike message sequence numbers, unique identifiers are not necessarily contiguous. The unique identifier of a message MUST NOT change during the session, and SHOULD NOT change between sessions. Any change of unique identifiers between sessions MUST be detectable using the UIDVALIDITY mechanism discussed below. Persistent unique identifiers are required for a client to resynchronize its state from a previous session with the server (e.g., disconnected or offline access clients); this is discussed further in [IMAP-DISC]. Associated with every mailbox are two values which aid in unique identifier handling: the next unique identifier value and the unique identifier validity value. The next unique identifier value is the predicted value that will be assigned to a new message in the mailbox. Unless the unique identifier validity also changes (see below), the next unique identifier value MUST have the following two characteristics. First, the next unique identifier value MUST NOT change unless new messages are added to the mailbox; and second, the next unique identifier value MUST change whenever new messages are added to the mailbox, even if those new messages are subsequently expunged. Note: The next unique identifier value is intended to provide a means for a client to determine whether any messages have been delivered to the mailbox since the previous time it checked this value. It is not intended to provide any guarantee that any message will have this unique identifier. A client can only assume, at the time that it obtains the next unique identifier value, that messages arriving after that time will have a UID greater than or equal to that value. The unique identifier validity value is sent in a UIDVALIDITY response code in an OK untagged response at mailbox selection time. If unique identifiers from an earlier session fail to persist in this session, the unique identifier validity value MUST be greater than the one used in the earlier session. Note: Ideally, unique identifiers SHOULD persist at all times. Although this specification recognizes that failure to persist can be unavoidable in certain server environments, it STRONGLY ENCOURAGES message store implementation techniques that avoid this problem. For example: 1) Unique identifiers MUST be strictly ascending in the mailbox at all times. If the physical message store is re-ordered by a non-IMAP agent, this requires that the unique identifiers in the mailbox be regenerated, since the former unique identifiers are no longer strictly ascending as a result of the re-ordering. 2) If the message store has no mechanism to store unique identifiers, it must regenerate unique identifiers at each session, and each session must have a unique UIDVALIDITY value. 3) If the mailbox is deleted and a new mailbox with the same name is created at a later date, the server must either keep track of unique identifiers from the previous instance of the mailbox, or it must assign a new UIDVALIDITY value to the new instance of the mailbox. A good UIDVALIDITY value to use in this case is a 32-bit representation of the creation date/time of the mailbox. It is alright to use a constant such as 1, but only if it guaranteed that unique identifiers will never be reused, even in the case of a mailbox being deleted (or renamed) and a new mailbox by the same name created at some future time. 4) The combination of mailbox name, UIDVALIDITY, and UID must refer to a single immutable message on that server forever. In particular, the internal date, [RFC-2822] size, envelope, body structure, and message texts (RFC822, RFC822.HEADER, RFC822.TEXT, and all BODY[...] fetch data items) must never change. This does not include message numbers, nor does it include attributes that can be set by a STORE command (e.g., FLAGS). [2] http://tools.ietf.org/html/rfc3501 2.3.1.2. Message Sequence Number Message Attribute A relative position from 1 to the number of messages in the mailbox. This position MUST be ordered by ascending unique identifier. As each new message is added, it is assigned a message sequence number that is 1 higher than the number of messages in the mailbox before that new message was added. Message sequence numbers can be reassigned during the session. For example, when a message is permanently removed (expunged) from the mailbox, the message sequence number for all subsequent messages is decremented. The number of messages in the mailbox is also decremented. Similarly, a new message can be assigned a message sequence number that was once held by some other message prior to an expunge. In addition to accessing messages by relative position in the mailbox, message sequence numbers can be used in mathematical calculations. For example, if an untagged "11 EXISTS" is received, and previously an untagged "8 EXISTS" was received, three new messages have arrived with message sequence numbers of 9, 10, and 11. Another example, if message 287 in a 523 message mailbox has UID 12345, there are exactly 286 messages which have lesser UIDs and 236 messages which have greater UIDs. [3] http://hbase.apache.org/ [4] http://cassandra.apache.org/ [5] http://incubator.apache.org/gora/ > [gsoc2011] Design and implement a distributed mailbox using Hadoop > ------------------------------------------------------------------ > > Key: MAILBOX-44 > URL: https://issues.apache.org/jira/browse/MAILBOX-44 > Project: James Mailbox > Issue Type: New Feature > Reporter: Eric Charles > Assignee: Norman Maurer > Labels: gsoc2011 > > Context: The mailbox subproject (http://james.apache.org/mailbox/) supports > maildir, SQL database (via JPA) and Java Content Repository (JCR) as > technology for mail storage. This flexibility is achieved thanks to a API > design that abstracts mail storage from the mail protocols. > Task: We need to implement mailbox storage as a distributed system on top of > Hadoop HDFS. The James mailbox API will be used. A first step is to design > how to interact with Hadoop (native api, gora incubator at apache,...) and > deal with specific performance questions related to mail loading/parsing in a > distributed system (use map/reduce or not, use existing local lucene indexes > for search,...). The second step is to implement the HDFS mailbox (maildir > mailbox is similar because is stores mails as a file and can be an > inspiration). A single James server will still be deployed because we don't > have any distributed UID generation. > Mentor: eric at apache dot org > Complexity: medium -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org