Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Noble Paul നോബിള്‍ नोब्ळ् Fri, 02 Jan 2009 21:52:53 -0800

On Fri, Jan 2, 2009 at 6:24 PM, Ben Johnson
<[email protected]> wrote:
> Hi Paul
>
> Yes, I was thinking that emails for all users would be indexed into a single
> index, at least conceptually.  I'm thinking of a corporate/organisational
> repository that any user could search for relevant information, be that
> email or some other kind of document (e.g. MS Office, OpenOffice, PDF,
> etc...).  An example usage would be for government organisations in the
> United Kingdom that need to respond to Freedom of Information (FOI) requests
> and are therefore required by law to produce all information regarding a
> particular subject if requested (sensitive information excluded).
>
> I haven't looked into architectural options for the indexes - I don't know
> if it is possible/desirable to split indexes up and use some sort of
> federated search to produce results, but at least conceptually I was
> thinking of a single source for the indexing information.


A very simple solution is to keep all users in single index and use
'fq' and limit the search to that user.

>
> Regards
> Ben
>
> --------------------------------------------------
> From: "Noble Paul നോബിള്‍ नोब्ळ्" <[email protected]>
> Sent: Friday, January 02, 2009 11:21 AM
> To: <[email protected]>
> Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a
> solr index through DIH.
>
>> On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson
>> <[email protected]> wrote:
>>>
>>> Thanks Paul and Preetam.  A couple of further things:
>>>
>>> - How do you envisage this functionality being used?  I can see indexing
>>> all
>>> emails for all users as part of a one-off system setup/migration process,
>>> but also as a core feature to ensure all emails received by a
>>> company/organisation are indexed (and stored).  This could be done either
>>> by
>>> the end-user, who controls what should be indexed (i.e. certain
>>> work-related
>>> emails only) or directly from the mail server, where all emails would be
>>> indexed (including personal emails, which could later be deleted from the
>>> index if desired) to ensure no important emails get missed.  Is this the
>>> sort of thing you had in mind?  There is also the issue of not
>>> indexing/storing the same email from multiple users' mailboxes (haven't
>>> worked that one out yet, possibly via a hash).
>>>
>>> - Is the mailbox 'configuration' (<entity> tag) stored in data-config.xml
>>> on
>>> the Solr server?  If so, this would seem to have quite a lot of
>>
>> Do you wish all users mails to be indexed into single index ? it is
>> possible by passing on the username password as request parameters .
>>
>>
>>> administrative overhead - how do you manage a system with 5000+ users?
>>> How
>>> are the accounts/passwords maintained?  Are the passwords stored in plain
>>> text?
>>>
>>> - Minor typo: *conectTimeout* should be *connectTimeout*
>>>
>>> - A few real-world scenarios I've encountered are:
>>>  - be able to handle an email sent to over 5000 recipients (in the 'To:'
>>> field)
>>>  - be able to handle an email with a 'long' subject line (240+
>>> characters)
>>>  - be able to handle an email with 100 attachments
>>>  - be able to handle an email with attachments with 'long' names (240+
>>> characters)
>>>
>>> This caused several problems in the software I was using at the time (a
>>> proprietary system, not Solr-based), either memory-related issues or file
>>> system errors when running on Windows where the file system or its API
>>> limited file names to 255 characters, including the path.
>>>
>>> Thanks very much!
>>> Ben
>>>
>>> --------------------------------------------------
>>> From: "Noble Paul നോബിള്‍ नोब्ळ्" <[email protected]>
>>> Sent: Friday, January 02, 2009 5:02 AM
>>> To: <[email protected]>
>>> Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into
>>> a
>>> solr index through DIH.
>>>
>>>> Hi Ben,
>>>> You can take a look at the wiki page for DIH
>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>
>>>> It helps you index mostly structured data into Solr from db, xml etc .
>>>> It can be considered as an ETL tool
>>>> (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.
>>>>
>>>> Adding mail support means you can index your emails into Sols with a
>>>> few lines of configuration
>>>> --Noble
>>>>
>>>> On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
>>>> <[email protected]> wrote:
>>>>>
>>>>> I'm watching this issue with interest, but I'm having trouble
>>>>> understanding
>>>>> the bigger picture.  I am prototyping a system that uses Restlet to
>>>>> store
>>>>> and index objects (mainly MS Office and OpenOffice documents and
>>>>> emails),
>>>>> so
>>>>> I am planning to use Solr with Tika to index the objects.
>>>>>
>>>>> I know nothing about DIH (Distributed Index Handler?), so I'm not sure
>>>>> what
>>>>> role it plays with Solr.  Is it a vendor-specific technology (from
>>>>> Autonomy)?  What does it do?  Do you give it objects to index and it
>>>>> handles
>>>>> them by passing it to one or more Solr/Tika indexing servers?  And are
>>>>> you
>>>>> thinking that this would therefore be a good place to not only index
>>>>> the
>>>>> objects, but also pass the information about the digital content to
>>>>> DROID?
>>>>>
>>>>> Reading a bit about DROID (from TNA, The National Archives), it seems
>>>>> like
>>>>> it is used to capture information about the digital content of objects
>>>>> stored in a content repository.  How does this fit with Solr?  I
>>>>> thought
>>>>> Solr with Tika just did the indexing of text-based objects, but the
>>>>> actual
>>>>> storage of the objects would be elsewhere (probably in the file
>>>>> system).
>>>>> From what I can tell, DROID would operate on the file system objects,
>>>>> not
>>>>> the indexing information.  Have I got this right?
>>>>>
>>>>> Ideally, I would also like to convert any suitable content into PDF/A
>>>>> format
>>>>> for long-term archival - probably not relevant to this issue, but I
>>>>> thought
>>>>> I'd mention it in case you see an application of this as part of email
>>>>> and
>>>>> attachment storage.
>>>>>
>>>>> Sorry for all the questions, but hopefully someone could clarify this
>>>>> for
>>>>> me!
>>>>>
>>>>> Thanks very much
>>>>> Ben Johnson
>>>>>
>>>>> --------------------------------------------------
>>>>> From: "Grant Ingersoll (JIRA)" <[email protected]>
>>>>> Sent: Thursday, January 01, 2009 7:07 PM
>>>>> To: <[email protected]>
>>>>> Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a
>>>>> solr
>>>>> index through DIH.
>>>>>
>>>>>>
>>>>>>  [
>>>>>>
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210
>>>>>> ]
>>>>>>
>>>>>> Grant Ingersoll commented on SOLR-934:
>>>>>> --------------------------------------
>>>>>>
>>>>>> Would it make more sense for DIH to farm out it's content acquisition
>>>>>> to
>>>>>> a
>>>>>> library like Droids?  Then, we could have real crawling, etc. all
>>>>>> through a
>>>>>> pluggable connector framework.
>>>>>>
>>>>>>> Enable importing of mails into a solr index through DIH.
>>>>>>> --------------------------------------------------------
>>>>>>>
>>>>>>>              Key: SOLR-934
>>>>>>>              URL: https://issues.apache.org/jira/browse/SOLR-934
>>>>>>>          Project: Solr
>>>>>>>       Issue Type: New Feature
>>>>>>>       Components: contrib - DataImportHandler
>>>>>>>  Affects Versions: 1.4
>>>>>>>         Reporter: Preetam Rao
>>>>>>>         Assignee: Shalin Shekhar Mangar
>>>>>>>          Fix For: 1.4
>>>>>>>
>>>>>>>      Attachments: SOLR-934.patch, SOLR-934.patch
>>>>>>>
>>>>>>>  Original Estimate: 24h
>>>>>>>  Remaining Estimate: 24h
>>>>>>>
>>>>>>> Enable importing of mails into solr through DIH. Take one or more
>>>>>>> mailbox
>>>>>>> credentials, download and index their content along with the content
>>>>>>> from
>>>>>>> attachments. The folders to fetch can be made configurable based on
>>>>>>> various
>>>>>>> criteria. Apache Tika is used for extracting content from different
>>>>>>> kinds of
>>>>>>> attachments. JavaMail is used for mail box related operations like
>>>>>>> fetching
>>>>>>> mails, filtering them etc.
>>>>>>> The basic configuration for one mail box is as below:
>>>>>>> {code:xml}
>>>>>>> <document>
>>>>>>>  <entity processor="MailEntityProcessor" user="[email protected]"
>>>>>>>              password="something" host="imap.gmail.com"
>>>>>>> protocol="imaps"/>
>>>>>>> </document>
>>>>>>> {code}
>>>>>>> The below is the list of all configuration available:
>>>>>>> {color:green}Required{color}
>>>>>>> ---------
>>>>>>> *user*
>>>>>>> *pwd*
>>>>>>> *protocol*  (only "imaps" supported now)
>>>>>>> *host*
>>>>>>> {color:green}Optional{color}
>>>>>>> ---------
>>>>>>> *folders* - comma seperated list of folders.
>>>>>>> If not specified, default folder is used. Nested folders can be
>>>>>>> specified
>>>>>>> like a/b/c
>>>>>>> *recurse* - index subfolders. Defaults to true.
>>>>>>> *exclude* - comma seperated list of patterns.
>>>>>>> *include* - comma seperated list of patterns.
>>>>>>> *batchSize* - mails to fetch at once in a given folder.
>>>>>>> Only headers can be prefetched in Javamail IMAP.
>>>>>>> *readTimeout* - defaults to 60000ms
>>>>>>> *conectTimeout* - defaults to 30000ms
>>>>>>> *fetchSize* - IMAP config. 32KB default
>>>>>>> *fetchMailsSince* -
>>>>>>> date/time in miliiseconds, mails received after which will be
>>>>>>> fetched.
>>>>>>> Useful for delta import.
>>>>>>> *customFilter* - class name.
>>>>>>> {code}
>>>>>>> import javax.mail.Folder;
>>>>>>> import javax.mail.SearchTerm;
>>>>>>> clz implements MailEntityProcessor.CustomFilter() {
>>>>>>> public SearchTerm getCustomSearch(Folder folder);
>>>>>>> }
>>>>>>> {code}
>>>>>>> *processAttachement* - defaults to true
>>>>>>> The below are the indexed fields.
>>>>>>> {code}
>>>>>>>  // Fields To Index
>>>>>>>  // single valued
>>>>>>>  private static final String SUBJECT = "subject";
>>>>>>>  private static final String FROM = "from";
>>>>>>>  private static final String SENT_DATE = "sentDate";
>>>>>>>  private static final String XMAILER = "xMailer";
>>>>>>>  // multi valued
>>>>>>>  private static final String TO_CC_BCC = "allTo";
>>>>>>>  private static final String FLAGS = "flags";
>>>>>>>  private static final String CONTENT = "content";
>>>>>>>  private static final String ATTACHMENT = "attachement";
>>>>>>>  private static final String ATTACHMENT_NAMES = "attachementNames";
>>>>>>>  // flag values
>>>>>>>  private static final String FLAG_ANSWERED = "answered";
>>>>>>>  private static final String FLAG_DELETED = "deleted";
>>>>>>>  private static final String FLAG_DRAFT = "draft";
>>>>>>>  private static final String FLAG_FLAGGED = "flagged";
>>>>>>>  private static final String FLAG_RECENT = "recent";
>>>>>>>  private static final String FLAG_SEEN = "seen";
>>>>>>> {code}
>>>>>>
>>>>>> --
>>>>>> This message is automatically generated by JIRA.
>>>>>> -
>>>>>> You can reply to this email to add a comment to the issue online.
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>



-- 
--Noble Paul

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Reply via email to