Hi Ben, You can take a look at the wiki page for DIH http://wiki.apache.org/solr/DataImportHandler
It helps you index mostly structured data into Solr from db, xml etc . It can be considered as an ETL tool (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr. Adding mail support means you can index your emails into Sols with a few lines of configuration --Noble On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson <[email protected]> wrote: > I'm watching this issue with interest, but I'm having trouble understanding > the bigger picture. I am prototyping a system that uses Restlet to store > and index objects (mainly MS Office and OpenOffice documents and emails), so > I am planning to use Solr with Tika to index the objects. > > I know nothing about DIH (Distributed Index Handler?), so I'm not sure what > role it plays with Solr. Is it a vendor-specific technology (from > Autonomy)? What does it do? Do you give it objects to index and it handles > them by passing it to one or more Solr/Tika indexing servers? And are you > thinking that this would therefore be a good place to not only index the > objects, but also pass the information about the digital content to DROID? > > Reading a bit about DROID (from TNA, The National Archives), it seems like > it is used to capture information about the digital content of objects > stored in a content repository. How does this fit with Solr? I thought > Solr with Tika just did the indexing of text-based objects, but the actual > storage of the objects would be elsewhere (probably in the file system). > From what I can tell, DROID would operate on the file system objects, not > the indexing information. Have I got this right? > > Ideally, I would also like to convert any suitable content into PDF/A format > for long-term archival - probably not relevant to this issue, but I thought > I'd mention it in case you see an application of this as part of email and > attachment storage. > > Sorry for all the questions, but hopefully someone could clarify this for > me! > > Thanks very much > Ben Johnson > > -------------------------------------------------- > From: "Grant Ingersoll (JIRA)" <[email protected]> > Sent: Thursday, January 01, 2009 7:07 PM > To: <[email protected]> > Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr > index through DIH. > >> >> [ >> https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210 >> ] >> >> Grant Ingersoll commented on SOLR-934: >> -------------------------------------- >> >> Would it make more sense for DIH to farm out it's content acquisition to a >> library like Droids? Then, we could have real crawling, etc. all through a >> pluggable connector framework. >> >>> Enable importing of mails into a solr index through DIH. >>> -------------------------------------------------------- >>> >>> Key: SOLR-934 >>> URL: https://issues.apache.org/jira/browse/SOLR-934 >>> Project: Solr >>> Issue Type: New Feature >>> Components: contrib - DataImportHandler >>> Affects Versions: 1.4 >>> Reporter: Preetam Rao >>> Assignee: Shalin Shekhar Mangar >>> Fix For: 1.4 >>> >>> Attachments: SOLR-934.patch, SOLR-934.patch >>> >>> Original Estimate: 24h >>> Remaining Estimate: 24h >>> >>> Enable importing of mails into solr through DIH. Take one or more mailbox >>> credentials, download and index their content along with the content from >>> attachments. The folders to fetch can be made configurable based on various >>> criteria. Apache Tika is used for extracting content from different kinds of >>> attachments. JavaMail is used for mail box related operations like fetching >>> mails, filtering them etc. >>> The basic configuration for one mail box is as below: >>> {code:xml} >>> <document> >>> <entity processor="MailEntityProcessor" user="[email protected]" >>> password="something" host="imap.gmail.com" >>> protocol="imaps"/> >>> </document> >>> {code} >>> The below is the list of all configuration available: >>> {color:green}Required{color} >>> --------- >>> *user* >>> *pwd* >>> *protocol* (only "imaps" supported now) >>> *host* >>> {color:green}Optional{color} >>> --------- >>> *folders* - comma seperated list of folders. >>> If not specified, default folder is used. Nested folders can be specified >>> like a/b/c >>> *recurse* - index subfolders. Defaults to true. >>> *exclude* - comma seperated list of patterns. >>> *include* - comma seperated list of patterns. >>> *batchSize* - mails to fetch at once in a given folder. >>> Only headers can be prefetched in Javamail IMAP. >>> *readTimeout* - defaults to 60000ms >>> *conectTimeout* - defaults to 30000ms >>> *fetchSize* - IMAP config. 32KB default >>> *fetchMailsSince* - >>> date/time in miliiseconds, mails received after which will be fetched. >>> Useful for delta import. >>> *customFilter* - class name. >>> {code} >>> import javax.mail.Folder; >>> import javax.mail.SearchTerm; >>> clz implements MailEntityProcessor.CustomFilter() { >>> public SearchTerm getCustomSearch(Folder folder); >>> } >>> {code} >>> *processAttachement* - defaults to true >>> The below are the indexed fields. >>> {code} >>> // Fields To Index >>> // single valued >>> private static final String SUBJECT = "subject"; >>> private static final String FROM = "from"; >>> private static final String SENT_DATE = "sentDate"; >>> private static final String XMAILER = "xMailer"; >>> // multi valued >>> private static final String TO_CC_BCC = "allTo"; >>> private static final String FLAGS = "flags"; >>> private static final String CONTENT = "content"; >>> private static final String ATTACHMENT = "attachement"; >>> private static final String ATTACHMENT_NAMES = "attachementNames"; >>> // flag values >>> private static final String FLAG_ANSWERED = "answered"; >>> private static final String FLAG_DELETED = "deleted"; >>> private static final String FLAG_DRAFT = "draft"; >>> private static final String FLAG_FLAGGED = "flagged"; >>> private static final String FLAG_RECENT = "recent"; >>> private static final String FLAG_SEEN = "seen"; >>> {code} >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> > -- --Noble Paul
