On Fri, Jan 2, 2009 at 6:24 PM, Ben Johnson <[email protected]> wrote: > Hi Paul > > Yes, I was thinking that emails for all users would be indexed into a single > index, at least conceptually. I'm thinking of a corporate/organisational > repository that any user could search for relevant information, be that > email or some other kind of document (e.g. MS Office, OpenOffice, PDF, > etc...). An example usage would be for government organisations in the > United Kingdom that need to respond to Freedom of Information (FOI) requests > and are therefore required by law to produce all information regarding a > particular subject if requested (sensitive information excluded). > > I haven't looked into architectural options for the indexes - I don't know > if it is possible/desirable to split indexes up and use some sort of > federated search to produce results, but at least conceptually I was > thinking of a single source for the indexing information.
A very simple solution is to keep all users in single index and use 'fq' and limit the search to that user. > > Regards > Ben > > -------------------------------------------------- > From: "Noble Paul നോബിള് नोब्ळ्" <[email protected]> > Sent: Friday, January 02, 2009 11:21 AM > To: <[email protected]> > Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a > solr index through DIH. > >> On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson >> <[email protected]> wrote: >>> >>> Thanks Paul and Preetam. A couple of further things: >>> >>> - How do you envisage this functionality being used? I can see indexing >>> all >>> emails for all users as part of a one-off system setup/migration process, >>> but also as a core feature to ensure all emails received by a >>> company/organisation are indexed (and stored). This could be done either >>> by >>> the end-user, who controls what should be indexed (i.e. certain >>> work-related >>> emails only) or directly from the mail server, where all emails would be >>> indexed (including personal emails, which could later be deleted from the >>> index if desired) to ensure no important emails get missed. Is this the >>> sort of thing you had in mind? There is also the issue of not >>> indexing/storing the same email from multiple users' mailboxes (haven't >>> worked that one out yet, possibly via a hash). >>> >>> - Is the mailbox 'configuration' (<entity> tag) stored in data-config.xml >>> on >>> the Solr server? If so, this would seem to have quite a lot of >> >> Do you wish all users mails to be indexed into single index ? it is >> possible by passing on the username password as request parameters . >> >> >>> administrative overhead - how do you manage a system with 5000+ users? >>> How >>> are the accounts/passwords maintained? Are the passwords stored in plain >>> text? >>> >>> - Minor typo: *conectTimeout* should be *connectTimeout* >>> >>> - A few real-world scenarios I've encountered are: >>> - be able to handle an email sent to over 5000 recipients (in the 'To:' >>> field) >>> - be able to handle an email with a 'long' subject line (240+ >>> characters) >>> - be able to handle an email with 100 attachments >>> - be able to handle an email with attachments with 'long' names (240+ >>> characters) >>> >>> This caused several problems in the software I was using at the time (a >>> proprietary system, not Solr-based), either memory-related issues or file >>> system errors when running on Windows where the file system or its API >>> limited file names to 255 characters, including the path. >>> >>> Thanks very much! >>> Ben >>> >>> -------------------------------------------------- >>> From: "Noble Paul നോബിള് नोब्ळ्" <[email protected]> >>> Sent: Friday, January 02, 2009 5:02 AM >>> To: <[email protected]> >>> Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into >>> a >>> solr index through DIH. >>> >>>> Hi Ben, >>>> You can take a look at the wiki page for DIH >>>> http://wiki.apache.org/solr/DataImportHandler >>>> >>>> It helps you index mostly structured data into Solr from db, xml etc . >>>> It can be considered as an ETL tool >>>> (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr. >>>> >>>> Adding mail support means you can index your emails into Sols with a >>>> few lines of configuration >>>> --Noble >>>> >>>> On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson >>>> <[email protected]> wrote: >>>>> >>>>> I'm watching this issue with interest, but I'm having trouble >>>>> understanding >>>>> the bigger picture. I am prototyping a system that uses Restlet to >>>>> store >>>>> and index objects (mainly MS Office and OpenOffice documents and >>>>> emails), >>>>> so >>>>> I am planning to use Solr with Tika to index the objects. >>>>> >>>>> I know nothing about DIH (Distributed Index Handler?), so I'm not sure >>>>> what >>>>> role it plays with Solr. Is it a vendor-specific technology (from >>>>> Autonomy)? What does it do? Do you give it objects to index and it >>>>> handles >>>>> them by passing it to one or more Solr/Tika indexing servers? And are >>>>> you >>>>> thinking that this would therefore be a good place to not only index >>>>> the >>>>> objects, but also pass the information about the digital content to >>>>> DROID? >>>>> >>>>> Reading a bit about DROID (from TNA, The National Archives), it seems >>>>> like >>>>> it is used to capture information about the digital content of objects >>>>> stored in a content repository. How does this fit with Solr? I >>>>> thought >>>>> Solr with Tika just did the indexing of text-based objects, but the >>>>> actual >>>>> storage of the objects would be elsewhere (probably in the file >>>>> system). >>>>> From what I can tell, DROID would operate on the file system objects, >>>>> not >>>>> the indexing information. Have I got this right? >>>>> >>>>> Ideally, I would also like to convert any suitable content into PDF/A >>>>> format >>>>> for long-term archival - probably not relevant to this issue, but I >>>>> thought >>>>> I'd mention it in case you see an application of this as part of email >>>>> and >>>>> attachment storage. >>>>> >>>>> Sorry for all the questions, but hopefully someone could clarify this >>>>> for >>>>> me! >>>>> >>>>> Thanks very much >>>>> Ben Johnson >>>>> >>>>> -------------------------------------------------- >>>>> From: "Grant Ingersoll (JIRA)" <[email protected]> >>>>> Sent: Thursday, January 01, 2009 7:07 PM >>>>> To: <[email protected]> >>>>> Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a >>>>> solr >>>>> index through DIH. >>>>> >>>>>> >>>>>> [ >>>>>> >>>>>> >>>>>> https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210 >>>>>> ] >>>>>> >>>>>> Grant Ingersoll commented on SOLR-934: >>>>>> -------------------------------------- >>>>>> >>>>>> Would it make more sense for DIH to farm out it's content acquisition >>>>>> to >>>>>> a >>>>>> library like Droids? Then, we could have real crawling, etc. all >>>>>> through a >>>>>> pluggable connector framework. >>>>>> >>>>>>> Enable importing of mails into a solr index through DIH. >>>>>>> -------------------------------------------------------- >>>>>>> >>>>>>> Key: SOLR-934 >>>>>>> URL: https://issues.apache.org/jira/browse/SOLR-934 >>>>>>> Project: Solr >>>>>>> Issue Type: New Feature >>>>>>> Components: contrib - DataImportHandler >>>>>>> Affects Versions: 1.4 >>>>>>> Reporter: Preetam Rao >>>>>>> Assignee: Shalin Shekhar Mangar >>>>>>> Fix For: 1.4 >>>>>>> >>>>>>> Attachments: SOLR-934.patch, SOLR-934.patch >>>>>>> >>>>>>> Original Estimate: 24h >>>>>>> Remaining Estimate: 24h >>>>>>> >>>>>>> Enable importing of mails into solr through DIH. Take one or more >>>>>>> mailbox >>>>>>> credentials, download and index their content along with the content >>>>>>> from >>>>>>> attachments. The folders to fetch can be made configurable based on >>>>>>> various >>>>>>> criteria. Apache Tika is used for extracting content from different >>>>>>> kinds of >>>>>>> attachments. JavaMail is used for mail box related operations like >>>>>>> fetching >>>>>>> mails, filtering them etc. >>>>>>> The basic configuration for one mail box is as below: >>>>>>> {code:xml} >>>>>>> <document> >>>>>>> <entity processor="MailEntityProcessor" user="[email protected]" >>>>>>> password="something" host="imap.gmail.com" >>>>>>> protocol="imaps"/> >>>>>>> </document> >>>>>>> {code} >>>>>>> The below is the list of all configuration available: >>>>>>> {color:green}Required{color} >>>>>>> --------- >>>>>>> *user* >>>>>>> *pwd* >>>>>>> *protocol* (only "imaps" supported now) >>>>>>> *host* >>>>>>> {color:green}Optional{color} >>>>>>> --------- >>>>>>> *folders* - comma seperated list of folders. >>>>>>> If not specified, default folder is used. Nested folders can be >>>>>>> specified >>>>>>> like a/b/c >>>>>>> *recurse* - index subfolders. Defaults to true. >>>>>>> *exclude* - comma seperated list of patterns. >>>>>>> *include* - comma seperated list of patterns. >>>>>>> *batchSize* - mails to fetch at once in a given folder. >>>>>>> Only headers can be prefetched in Javamail IMAP. >>>>>>> *readTimeout* - defaults to 60000ms >>>>>>> *conectTimeout* - defaults to 30000ms >>>>>>> *fetchSize* - IMAP config. 32KB default >>>>>>> *fetchMailsSince* - >>>>>>> date/time in miliiseconds, mails received after which will be >>>>>>> fetched. >>>>>>> Useful for delta import. >>>>>>> *customFilter* - class name. >>>>>>> {code} >>>>>>> import javax.mail.Folder; >>>>>>> import javax.mail.SearchTerm; >>>>>>> clz implements MailEntityProcessor.CustomFilter() { >>>>>>> public SearchTerm getCustomSearch(Folder folder); >>>>>>> } >>>>>>> {code} >>>>>>> *processAttachement* - defaults to true >>>>>>> The below are the indexed fields. >>>>>>> {code} >>>>>>> // Fields To Index >>>>>>> // single valued >>>>>>> private static final String SUBJECT = "subject"; >>>>>>> private static final String FROM = "from"; >>>>>>> private static final String SENT_DATE = "sentDate"; >>>>>>> private static final String XMAILER = "xMailer"; >>>>>>> // multi valued >>>>>>> private static final String TO_CC_BCC = "allTo"; >>>>>>> private static final String FLAGS = "flags"; >>>>>>> private static final String CONTENT = "content"; >>>>>>> private static final String ATTACHMENT = "attachement"; >>>>>>> private static final String ATTACHMENT_NAMES = "attachementNames"; >>>>>>> // flag values >>>>>>> private static final String FLAG_ANSWERED = "answered"; >>>>>>> private static final String FLAG_DELETED = "deleted"; >>>>>>> private static final String FLAG_DRAFT = "draft"; >>>>>>> private static final String FLAG_FLAGGED = "flagged"; >>>>>>> private static final String FLAG_RECENT = "recent"; >>>>>>> private static final String FLAG_SEEN = "seen"; >>>>>>> {code} >>>>>> >>>>>> -- >>>>>> This message is automatically generated by JIRA. >>>>>> - >>>>>> You can reply to this email to add a comment to the issue online. >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> --Noble Paul >>> >>> >> >> >> >> -- >> --Noble Paul >> > -- --Noble Paul
