Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Ben Johnson Thu, 01 Jan 2009 15:01:19 -0800

I'm watching this issue with interest, but I'm having trouble understandingthe bigger picture. I am prototyping a system that uses Restlet to storeand index objects (mainly MS Office and OpenOffice documents and emails), soI am planning to use Solr with Tika to index the objects.

I know nothing about DIH (Distributed Index Handler?), so I'm not sure whatrole it plays with Solr. Is it a vendor-specific technology (fromAutonomy)? What does it do? Do you give it objects to index and it handlesthem by passing it to one or more Solr/Tika indexing servers? And are youthinking that this would therefore be a good place to not only index theobjects, but also pass the information about the digital content to DROID?

Reading a bit about DROID (from TNA, The National Archives), it seems likeit is used to capture information about the digital content of objectsstored in a content repository. How does this fit with Solr? I thoughtSolr with Tika just did the indexing of text-based objects, but the actualstorage of the objects would be elsewhere (probably in the file system).

From what I can tell, DROID would operate on the file system objects, not

the indexing information.  Have I got this right?

Ideally, I would also like to convert any suitable content into PDF/A formatfor long-term archival - probably not relevant to this issue, but I thoughtI'd mention it in case you see an application of this as part of email andattachment storage.

Sorry for all the questions, but hopefully someone could clarify this forme!


Thanks very much
Ben Johnson

--------------------------------------------------
From: "Grant Ingersoll (JIRA)" <[email protected]>
Sent: Thursday, January 01, 2009 7:07 PM
To: <[email protected]>

Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solrindex through DIH.

[https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210 ]


Grant Ingersoll commented on SOLR-934:
--------------------------------------

Would it make more sense for DIH to farm out it's content acquisition to alibrary like Droids? Then, we could have real crawling, etc. all througha pluggable connector framework.

Enable importing of mails into a solr index through DIH.
--------------------------------------------------------

                Key: SOLR-934
                URL: https://issues.apache.org/jira/browse/SOLR-934
            Project: Solr
         Issue Type: New Feature
         Components: contrib - DataImportHandler
   Affects Versions: 1.4
           Reporter: Preetam Rao
           Assignee: Shalin Shekhar Mangar
            Fix For: 1.4

        Attachments: SOLR-934.patch, SOLR-934.patch

  Original Estimate: 24h
 Remaining Estimate: 24h

Enable importing of mails into solr through DIH. Take one or more mailboxcredentials, download and index their content along with the content fromattachments. The folders to fetch can be made configurable based onvarious criteria. Apache Tika is used for extracting content fromdifferent kinds of attachments. JavaMail is used for mail box relatedoperations like fetching mails, filtering them etc.

The basic configuration for one mail box is as below:
{code:xml}
<document>
   <entity processor="MailEntityProcessor" user="[email protected]"

password="something" host="imap.gmail.com"protocol="imaps"/>

</document>
{code}
The below is the list of all configuration available:
{color:green}Required{color}
---------
*user*
*pwd*
*protocol*  (only "imaps" supported now)
*host*
{color:green}Optional{color}
---------
*folders* - comma seperated list of folders.

If not specified, default folder is used. Nested folders can be specifiedlike a/b/c

*recurse* - index subfolders. Defaults to true.
*exclude* - comma seperated list of patterns.
*include* - comma seperated list of patterns.
*batchSize* - mails to fetch at once in a given folder.
Only headers can be prefetched in Javamail IMAP.
*readTimeout* - defaults to 60000ms
*conectTimeout* - defaults to 30000ms
*fetchSize* - IMAP config. 32KB default
*fetchMailsSince* -

date/time in miliiseconds, mails received after which will be fetched.Useful for delta import.

*customFilter* - class name.
{code}
import javax.mail.Folder;
import javax.mail.SearchTerm;
clz implements MailEntityProcessor.CustomFilter() {
public SearchTerm getCustomSearch(Folder folder);
}
{code}
*processAttachement* - defaults to true
The below are the indexed fields.
{code}
  // Fields To Index
  // single valued
  private static final String SUBJECT = "subject";
  private static final String FROM = "from";
  private static final String SENT_DATE = "sentDate";
  private static final String XMAILER = "xMailer";
  // multi valued
  private static final String TO_CC_BCC = "allTo";
  private static final String FLAGS = "flags";
  private static final String CONTENT = "content";
  private static final String ATTACHMENT = "attachement";
  private static final String ATTACHMENT_NAMES = "attachementNames";
  // flag values
  private static final String FLAG_ANSWERED = "answered";
  private static final String FLAG_DELETED = "deleted";
  private static final String FLAG_DRAFT = "draft";
  private static final String FLAG_FLAGGED = "flagged";
  private static final String FLAG_RECENT = "recent";
  private static final String FLAG_SEEN = "seen";
{code}


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Reply via email to