Quick clarifications: - Droids: http://incubator.apache.org/droids/index.html - DIH: http://wiki.apache.org/solr/DataImportHandler - Solr + Tika: http://wiki.apache.org/solr/ExtractingRequestHandler
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Ben Johnson <[email protected]> > To: [email protected] > Sent: Thursday, January 1, 2009 6:00:43 PM > Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a > solr index through DIH. > > I'm watching this issue with interest, but I'm having trouble understanding > the > bigger picture. I am prototyping a system that uses Restlet to store and > index > objects (mainly MS Office and OpenOffice documents and emails), so I am > planning > to use Solr with Tika to index the objects. > > I know nothing about DIH (Distributed Index Handler?), so I'm not sure what > role > it plays with Solr. Is it a vendor-specific technology (from Autonomy)? > What > does it do? Do you give it objects to index and it handles them by passing > it > to one or more Solr/Tika indexing servers? And are you thinking that this > would > therefore be a good place to not only index the objects, but also pass the > information about the digital content to DROID? > > Reading a bit about DROID (from TNA, The National Archives), it seems like it > is > used to capture information about the digital content of objects stored in a > content repository. How does this fit with Solr? I thought Solr with Tika > just > did the indexing of text-based objects, but the actual storage of the objects > would be elsewhere (probably in the file system). From what I can tell, DROID > would operate on the file system objects, not the indexing information. Have > I > got this right? > > Ideally, I would also like to convert any suitable content into PDF/A format > for > long-term archival - probably not relevant to this issue, but I thought I'd > mention it in case you see an application of this as part of email and > attachment storage. > > Sorry for all the questions, but hopefully someone could clarify this for me! > > Thanks very much > Ben Johnson > > -------------------------------------------------- > From: "Grant Ingersoll (JIRA)" > Sent: Thursday, January 01, 2009 7:07 PM > To: > Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr > index through DIH. > > > > > [ > https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210 > > ] > > > > Grant Ingersoll commented on SOLR-934: > > -------------------------------------- > > > > Would it make more sense for DIH to farm out it's content acquisition to a > library like Droids? Then, we could have real crawling, etc. all through a > pluggable connector framework. > > > >> Enable importing of mails into a solr index through DIH. > >> -------------------------------------------------------- > >> > >> Key: SOLR-934 > >> URL: https://issues.apache.org/jira/browse/SOLR-934 > >> Project: Solr > >> Issue Type: New Feature > >> Components: contrib - DataImportHandler > >> Affects Versions: 1.4 > >> Reporter: Preetam Rao > >> Assignee: Shalin Shekhar Mangar > >> Fix For: 1.4 > >> > >> Attachments: SOLR-934.patch, SOLR-934.patch > >> > >> Original Estimate: 24h > >> Remaining Estimate: 24h > >> > >> Enable importing of mails into solr through DIH. Take one or more mailbox > credentials, download and index their content along with the content from > attachments. The folders to fetch can be made configurable based on various > criteria. Apache Tika is used for extracting content from different kinds of > attachments. JavaMail is used for mail box related operations like fetching > mails, filtering them etc. > >> The basic configuration for one mail box is as below: > >> {code:xml} > >> > >> > >> password="something" host="imap.gmail.com" > >> protocol="imaps"/> > >> > >> {code} > >> The below is the list of all configuration available: > >> {color:green}Required{color} > >> --------- > >> *user* > >> *pwd* > >> *protocol* (only "imaps" supported now) > >> *host* > >> {color:green}Optional{color} > >> --------- > >> *folders* - comma seperated list of folders. > >> If not specified, default folder is used. Nested folders can be specified > like a/b/c > >> *recurse* - index subfolders. Defaults to true. > >> *exclude* - comma seperated list of patterns. > >> *include* - comma seperated list of patterns. > >> *batchSize* - mails to fetch at once in a given folder. > >> Only headers can be prefetched in Javamail IMAP. > >> *readTimeout* - defaults to 60000ms > >> *conectTimeout* - defaults to 30000ms > >> *fetchSize* - IMAP config. 32KB default > >> *fetchMailsSince* - > >> date/time in miliiseconds, mails received after which will be fetched. > >> Useful > for delta import. > >> *customFilter* - class name. > >> {code} > >> import javax.mail.Folder; > >> import javax.mail.SearchTerm; > >> clz implements MailEntityProcessor.CustomFilter() { > >> public SearchTerm getCustomSearch(Folder folder); > >> } > >> {code} > >> *processAttachement* - defaults to true > >> The below are the indexed fields. > >> {code} > >> // Fields To Index > >> // single valued > >> private static final String SUBJECT = "subject"; > >> private static final String FROM = "from"; > >> private static final String SENT_DATE = "sentDate"; > >> private static final String XMAILER = "xMailer"; > >> // multi valued > >> private static final String TO_CC_BCC = "allTo"; > >> private static final String FLAGS = "flags"; > >> private static final String CONTENT = "content"; > >> private static final String ATTACHMENT = "attachement"; > >> private static final String ATTACHMENT_NAMES = "attachementNames"; > >> // flag values > >> private static final String FLAG_ANSWERED = "answered"; > >> private static final String FLAG_DELETED = "deleted"; > >> private static final String FLAG_DRAFT = "draft"; > >> private static final String FLAG_FLAGGED = "flagged"; > >> private static final String FLAG_RECENT = "recent"; > >> private static final String FLAG_SEEN = "seen"; > >> {code} > > > > -- This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > >
