[ 
https://issues.apache.org/jira/browse/MAILBOX-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihai Soloi updated MAILBOX-173:
--------------------------------

    Attachment: MAILBOX-173.patch

This patch is an inverted index in an HBase table to search through the mails 
in a mailbox.

The structure of the index is as follows.

   1. mailboxID  is an java.util.UUID
   2. the fields are now Enums, and what is stored is a byte that identifies 
that enum field.
   3. each of the terms in the fields are tokenized using the lucene 
org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer, but some fields are 
not tokenized due to their nature(SENT_DATE for example)

The row is composed of all the above byte arrays concatenated, so that 
searching can be done very fast through the HBase table, as well as lookup on 
the specific mailbox and field in the mail. The mailID is the qualifier in the 
static column family(only one column family) so that mail id's are found with 
relative ease.

This is for the mail document in itself, the flags are stored in a single row 
in the table(one row for each mailbox) and can be found easily by a scan. Each 
of the rows now has an empty value, where in the possible future we'll be able 
to store data related to the term frequency in the document.

What works currently are the searches based on the text, flags, headers, all 
criterions, uid and uid ranges. These are implemented using Filters inside an 
Endpoint Coprocessors due to the benefit they provide of less data transfer 
over the network and distributed processing on each region. 
                
> [gsoc2012] Distribuited mailbox indexing over HBase/HDFS
> --------------------------------------------------------
>
>                 Key: MAILBOX-173
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-173
>             Project: James Mailbox
>          Issue Type: New Feature
>          Components: hbase, lucene, store
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>              Labels: gsoc, gsoc2012, mentor
>         Attachments: MAILBOX-173.patch
>
>
> James provide a module called Lucene Mailbox Index that knows how to index 
> emails. Indexing is done by providing a suitable Lucene Directory 
> implementation that will store the index and allow searching. Lucene comes 
> with File system directory JDBC Directory and a few other implementations to 
> store the index in a file-system or in a database.
> In order to provide distributed search we should implement a Directory 
> implementation that will store the index in HBase. Such an implementation is 
> described very well here [1].
> [1] http://www.infoq.com/articles/LuceneHbase

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to