Hi Ben,
You can take a look at the wiki page for DIH
http://wiki.apache.org/solr/DataImportHandler

It helps you index mostly structured data into Solr from db, xml etc .
It can be considered as an ETL tool
(http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.

Adding mail support means you can index your emails into Sols with a
few lines of configuration
--Noble

On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
<[email protected]> wrote:
> I'm watching this issue with interest, but I'm having trouble understanding
> the bigger picture.  I am prototyping a system that uses Restlet to store
> and index objects (mainly MS Office and OpenOffice documents and emails), so
> I am planning to use Solr with Tika to index the objects.
>
> I know nothing about DIH (Distributed Index Handler?), so I'm not sure what
> role it plays with Solr.  Is it a vendor-specific technology (from
> Autonomy)?  What does it do?  Do you give it objects to index and it handles
> them by passing it to one or more Solr/Tika indexing servers?  And are you
> thinking that this would therefore be a good place to not only index the
> objects, but also pass the information about the digital content to DROID?
>
> Reading a bit about DROID (from TNA, The National Archives), it seems like
> it is used to capture information about the digital content of objects
> stored in a content repository.  How does this fit with Solr?  I thought
> Solr with Tika just did the indexing of text-based objects, but the actual
> storage of the objects would be elsewhere (probably in the file system).
> From what I can tell, DROID would operate on the file system objects, not
> the indexing information.  Have I got this right?
>
> Ideally, I would also like to convert any suitable content into PDF/A format
> for long-term archival - probably not relevant to this issue, but I thought
> I'd mention it in case you see an application of this as part of email and
> attachment storage.
>
> Sorry for all the questions, but hopefully someone could clarify this for
> me!
>
> Thanks very much
> Ben Johnson
>
> --------------------------------------------------
> From: "Grant Ingersoll (JIRA)" <[email protected]>
> Sent: Thursday, January 01, 2009 7:07 PM
> To: <[email protected]>
> Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr
> index through DIH.
>
>>
>>   [
>> https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210
>> ]
>>
>> Grant Ingersoll commented on SOLR-934:
>> --------------------------------------
>>
>> Would it make more sense for DIH to farm out it's content acquisition to a
>> library like Droids?  Then, we could have real crawling, etc. all through a
>> pluggable connector framework.
>>
>>> Enable importing of mails into a solr index through DIH.
>>> --------------------------------------------------------
>>>
>>>                Key: SOLR-934
>>>                URL: https://issues.apache.org/jira/browse/SOLR-934
>>>            Project: Solr
>>>         Issue Type: New Feature
>>>         Components: contrib - DataImportHandler
>>>   Affects Versions: 1.4
>>>           Reporter: Preetam Rao
>>>           Assignee: Shalin Shekhar Mangar
>>>            Fix For: 1.4
>>>
>>>        Attachments: SOLR-934.patch, SOLR-934.patch
>>>
>>>  Original Estimate: 24h
>>>  Remaining Estimate: 24h
>>>
>>> Enable importing of mails into solr through DIH. Take one or more mailbox
>>> credentials, download and index their content along with the content from
>>> attachments. The folders to fetch can be made configurable based on various
>>> criteria. Apache Tika is used for extracting content from different kinds of
>>> attachments. JavaMail is used for mail box related operations like fetching
>>> mails, filtering them etc.
>>> The basic configuration for one mail box is as below:
>>> {code:xml}
>>> <document>
>>>   <entity processor="MailEntityProcessor" user="[email protected]"
>>>                password="something" host="imap.gmail.com"
>>> protocol="imaps"/>
>>> </document>
>>> {code}
>>> The below is the list of all configuration available:
>>> {color:green}Required{color}
>>> ---------
>>> *user*
>>> *pwd*
>>> *protocol*  (only "imaps" supported now)
>>> *host*
>>> {color:green}Optional{color}
>>> ---------
>>> *folders* - comma seperated list of folders.
>>> If not specified, default folder is used. Nested folders can be specified
>>> like a/b/c
>>> *recurse* - index subfolders. Defaults to true.
>>> *exclude* - comma seperated list of patterns.
>>> *include* - comma seperated list of patterns.
>>> *batchSize* - mails to fetch at once in a given folder.
>>> Only headers can be prefetched in Javamail IMAP.
>>> *readTimeout* - defaults to 60000ms
>>> *conectTimeout* - defaults to 30000ms
>>> *fetchSize* - IMAP config. 32KB default
>>> *fetchMailsSince* -
>>> date/time in miliiseconds, mails received after which will be fetched.
>>> Useful for delta import.
>>> *customFilter* - class name.
>>> {code}
>>> import javax.mail.Folder;
>>> import javax.mail.SearchTerm;
>>> clz implements MailEntityProcessor.CustomFilter() {
>>> public SearchTerm getCustomSearch(Folder folder);
>>> }
>>> {code}
>>> *processAttachement* - defaults to true
>>> The below are the indexed fields.
>>> {code}
>>>  // Fields To Index
>>>  // single valued
>>>  private static final String SUBJECT = "subject";
>>>  private static final String FROM = "from";
>>>  private static final String SENT_DATE = "sentDate";
>>>  private static final String XMAILER = "xMailer";
>>>  // multi valued
>>>  private static final String TO_CC_BCC = "allTo";
>>>  private static final String FLAGS = "flags";
>>>  private static final String CONTENT = "content";
>>>  private static final String ATTACHMENT = "attachement";
>>>  private static final String ATTACHMENT_NAMES = "attachementNames";
>>>  // flag values
>>>  private static final String FLAG_ANSWERED = "answered";
>>>  private static final String FLAG_DELETED = "deleted";
>>>  private static final String FLAG_DRAFT = "draft";
>>>  private static final String FLAG_FLAGGED = "flagged";
>>>  private static final String FLAG_RECENT = "recent";
>>>  private static final String FLAG_SEEN = "seen";
>>> {code}
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>



-- 
--Noble Paul

Reply via email to