Quick clarifications:

- Droids: http://incubator.apache.org/droids/index.html
- DIH: http://wiki.apache.org/solr/DataImportHandler
- Solr + Tika: http://wiki.apache.org/solr/ExtractingRequestHandler


Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Ben Johnson <[email protected]>
> To: [email protected]
> Sent: Thursday, January 1, 2009 6:00:43 PM
> Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a 
> solr index through DIH.
> 
> I'm watching this issue with interest, but I'm having trouble understanding 
> the 
> bigger picture.  I am prototyping a system that uses Restlet to store and 
> index 
> objects (mainly MS Office and OpenOffice documents and emails), so I am 
> planning 
> to use Solr with Tika to index the objects.
> 
> I know nothing about DIH (Distributed Index Handler?), so I'm not sure what 
> role 
> it plays with Solr.  Is it a vendor-specific technology (from Autonomy)?  
> What 
> does it do?  Do you give it objects to index and it handles them by passing 
> it 
> to one or more Solr/Tika indexing servers?  And are you thinking that this 
> would 
> therefore be a good place to not only index the objects, but also pass the 
> information about the digital content to DROID?
> 
> Reading a bit about DROID (from TNA, The National Archives), it seems like it 
> is 
> used to capture information about the digital content of objects stored in a 
> content repository.  How does this fit with Solr?  I thought Solr with Tika 
> just 
> did the indexing of text-based objects, but the actual storage of the objects 
> would be elsewhere (probably in the file system). From what I can tell, DROID 
> would operate on the file system objects, not the indexing information.  Have 
> I 
> got this right?
> 
> Ideally, I would also like to convert any suitable content into PDF/A format 
> for 
> long-term archival - probably not relevant to this issue, but I thought I'd 
> mention it in case you see an application of this as part of email and 
> attachment storage.
> 
> Sorry for all the questions, but hopefully someone could clarify this for me!
> 
> Thanks very much
> Ben Johnson
> 
> --------------------------------------------------
> From: "Grant Ingersoll (JIRA)" 
> Sent: Thursday, January 01, 2009 7:07 PM
> To: 
> Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr 
> index through DIH.
> 
> > 
> >    [ 
> https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210
>  
> ]
> > 
> > Grant Ingersoll commented on SOLR-934:
> > --------------------------------------
> > 
> > Would it make more sense for DIH to farm out it's content acquisition to a 
> library like Droids?  Then, we could have real crawling, etc. all through a 
> pluggable connector framework.
> > 
> >> Enable importing of mails into a solr index through DIH.
> >> --------------------------------------------------------
> >> 
> >>                 Key: SOLR-934
> >>                 URL: https://issues.apache.org/jira/browse/SOLR-934
> >>             Project: Solr
> >>          Issue Type: New Feature
> >>          Components: contrib - DataImportHandler
> >>    Affects Versions: 1.4
> >>            Reporter: Preetam Rao
> >>            Assignee: Shalin Shekhar Mangar
> >>             Fix For: 1.4
> >> 
> >>         Attachments: SOLR-934.patch, SOLR-934.patch
> >> 
> >>   Original Estimate: 24h
> >>  Remaining Estimate: 24h
> >> 
> >> Enable importing of mails into solr through DIH. Take one or more mailbox 
> credentials, download and index their content along with the content from 
> attachments. The folders to fetch can be made configurable based on various 
> criteria. Apache Tika is used for extracting content from different kinds of 
> attachments. JavaMail is used for mail box related operations like fetching 
> mails, filtering them etc.
> >> The basic configuration for one mail box is as below:
> >> {code:xml}
> >> 
> >>    
> >>                 password="something" host="imap.gmail.com" 
> >> protocol="imaps"/>
> >> 
> >> {code}
> >> The below is the list of all configuration available:
> >> {color:green}Required{color}
> >> ---------
> >> *user*
> >> *pwd*
> >> *protocol*  (only "imaps" supported now)
> >> *host*
> >> {color:green}Optional{color}
> >> ---------
> >> *folders* - comma seperated list of folders.
> >> If not specified, default folder is used. Nested folders can be specified 
> like a/b/c
> >> *recurse* - index subfolders. Defaults to true.
> >> *exclude* - comma seperated list of patterns.
> >> *include* - comma seperated list of patterns.
> >> *batchSize* - mails to fetch at once in a given folder.
> >> Only headers can be prefetched in Javamail IMAP.
> >> *readTimeout* - defaults to 60000ms
> >> *conectTimeout* - defaults to 30000ms
> >> *fetchSize* - IMAP config. 32KB default
> >> *fetchMailsSince* -
> >> date/time in miliiseconds, mails received after which will be fetched. 
> >> Useful 
> for delta import.
> >> *customFilter* - class name.
> >> {code}
> >> import javax.mail.Folder;
> >> import javax.mail.SearchTerm;
> >> clz implements MailEntityProcessor.CustomFilter() {
> >> public SearchTerm getCustomSearch(Folder folder);
> >> }
> >> {code}
> >> *processAttachement* - defaults to true
> >> The below are the indexed fields.
> >> {code}
> >>   // Fields To Index
> >>   // single valued
> >>   private static final String SUBJECT = "subject";
> >>   private static final String FROM = "from";
> >>   private static final String SENT_DATE = "sentDate";
> >>   private static final String XMAILER = "xMailer";
> >>   // multi valued
> >>   private static final String TO_CC_BCC = "allTo";
> >>   private static final String FLAGS = "flags";
> >>   private static final String CONTENT = "content";
> >>   private static final String ATTACHMENT = "attachement";
> >>   private static final String ATTACHMENT_NAMES = "attachementNames";
> >>   // flag values
> >>   private static final String FLAG_ANSWERED = "answered";
> >>   private static final String FLAG_DELETED = "deleted";
> >>   private static final String FLAG_DRAFT = "draft";
> >>   private static final String FLAG_FLAGGED = "flagged";
> >>   private static final String FLAG_RECENT = "recent";
> >>   private static final String FLAG_SEEN = "seen";
> >> {code}
> > 
> > -- This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> > 

Reply via email to