[
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar updated SOLR-934:
---------------------------------------
Attachment: SOLR-934.patch
Changes
# Added messageId as another field
# Added another core to example-DIH for indexing mails. When the example target
is run, it copies over the tika libs, mail.jar, activation.jar and extras.jar
into example/example-DIH/solr/mail/lib directory.
# Added a maven pom template for extras jar
# Updated maven related targets in the main build.xml for the new pom
# Added licenses for mail.jar and activation.jar in LICENSE.txt
I'm not sure what needs to be added to NOTICE.txt, can anybody help?
To run this:
# Apply this patch
# Create a directory called lib inside contrib/dataimporthandler
# Download and add mail.jar and activation.jar in the above directory
# Update example/example-DIH/solr/mail/conf/data-config.xml with your mail
server and login details
# Run ant clean example
# cd example
# java -Dsolr.solr.home=./example-DIH/solr -jar start.jar
# Hit http://localhost:8983/solr/mail/dataimport?command=full-import
I'll let people try this out before committing this in a day or two.
This will probably need some more enhancements which can be done through
additional issues. Some that I can think of are:
# Pluggable CustomFilter implementations
# Making fields/methods inside MailEntityProcessor protected so functionality
can be enhanced/overridden
# Attachments are stored as two attachment and attachmentNames fields -- a way
to associate one with another. I recall some discussion on the LocalSolr issue
about something similar for multiple lat/long pairs.
# Enhance example configuration to be able to run a mailing list search service
out-of-the-box
> Enable importing of mails into a solr index through DIH.
> --------------------------------------------------------
>
> Key: SOLR-934
> URL: https://issues.apache.org/jira/browse/SOLR-934
> Project: Solr
> Issue Type: New Feature
> Components: contrib - DataImportHandler
> Affects Versions: 1.4
> Reporter: Preetam Rao
> Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch,
> SOLR-934.patch, SOLR-934.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Enable importing of mails into solr through DIH. Take one or more mailbox
> credentials, download and index their content along with the content from
> attachments. The folders to fetch can be made configurable based on various
> criteria. Apache Tika is used for extracting content from different kinds of
> attachments. JavaMail is used for mail box related operations like fetching
> mails, filtering them etc.
> The basic configuration for one mail box is as below:
> {code:xml}
> <document>
> <entity processor="MailEntityProcessor" user="[email protected]"
> password="something" host="imap.gmail.com" protocol="imaps"/>
> </document>
> {code}
> The below is the list of all configuration available:
> {color:green}Required{color}
> ---------
> *user*
> *pwd*
> *protocol* (only "imaps" supported now)
> *host*
> {color:green}Optional{color}
> ---------
> *folders* - comma seperated list of folders.
> If not specified, default folder is used. Nested folders can be specified
> like a/b/c
> *recurse* - index subfolders. Defaults to true.
> *exclude* - comma seperated list of patterns.
> *include* - comma seperated list of patterns.
> *batchSize* - mails to fetch at once in a given folder.
> Only headers can be prefetched in Javamail IMAP.
> *readTimeout* - defaults to 60000ms
> *conectTimeout* - defaults to 30000ms
> *fetchSize* - IMAP config. 32KB default
> *fetchMailsSince* -
> date/time in "yyyy-MM-dd HH:mm:ss" format, mails received after which will be
> fetched. Useful for delta import.
> *customFilter* - class name.
> {code}
> import javax.mail.Folder;
> import javax.mail.SearchTerm;
> clz implements MailEntityProcessor.CustomFilter() {
> public SearchTerm getCustomSearch(Folder folder);
> }
> {code}
> *processAttachement* - defaults to true
> The below are the indexed fields.
> {code}
> // Fields To Index
> // single valued
> private static final String SUBJECT = "subject";
> private static final String FROM = "from";
> private static final String SENT_DATE = "sentDate";
> private static final String XMAILER = "xMailer";
> // multi valued
> private static final String TO_CC_BCC = "allTo";
> private static final String FLAGS = "flags";
> private static final String CONTENT = "content";
> private static final String ATTACHMENT = "attachement";
> private static final String ATTACHMENT_NAMES = "attachementNames";
> // flag values
> private static final String FLAG_ANSWERED = "answered";
> private static final String FLAG_DELETED = "deleted";
> private static final String FLAG_DRAFT = "draft";
> private static final String FLAG_FLAGGED = "flagged";
> private static final String FLAG_RECENT = "recent";
> private static final String FLAG_SEEN = "seen";
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.