Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Ben Johnson Fri, 02 Jan 2009 02:13:00 -0800

Thanks Paul and Preetam.  A couple of further things:

- How do you envisage this functionality being used? I can see indexing allemails for all users as part of a one-off system setup/migration process,but also as a core feature to ensure all emails received by acompany/organisation are indexed (and stored). This could be done either bythe end-user, who controls what should be indexed (i.e. certain work-relatedemails only) or directly from the mail server, where all emails would beindexed (including personal emails, which could later be deleted from theindex if desired) to ensure no important emails get missed. Is this thesort of thing you had in mind? There is also the issue of notindexing/storing the same email from multiple users' mailboxes (haven'tworked that one out yet, possibly via a hash).

- Is the mailbox 'configuration' (<entity> tag) stored in data-config.xml onthe Solr server? If so, this would seem to have quite a lot ofadministrative overhead - how do you manage a system with 5000+ users? Howare the accounts/passwords maintained? Are the passwords stored in plaintext?


- Minor typo: *conectTimeout* should be *connectTimeout*

- A few real-world scenarios I've encountered are:

- be able to handle an email sent to over 5000 recipients (in the 'To:'field)- be able to handle an email with a 'long' subject line (240+characters)

   - be able to handle an email with 100 attachments

- be able to handle an email with attachments with 'long' names (240+characters)

This caused several problems in the software I was using at the time (aproprietary system, not Solr-based), either memory-related issues or filesystem errors when running on Windows where the file system or its APIlimited file names to 255 characters, including the path.


Thanks very much!
Ben

--------------------------------------------------
From: "Noble Paul നോബിള്‍ नोब्ळ्" <[email protected]>
Sent: Friday, January 02, 2009 5:02 AM
To: <[email protected]>

Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into asolr index through DIH.

Hi Ben,
You can take a look at the wiki page for DIH
http://wiki.apache.org/solr/DataImportHandler

It helps you index mostly structured data into Solr from db, xml etc .
It can be considered as an ETL tool
(http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.

Adding mail support means you can index your emails into Sols with a
few lines of configuration
--Noble

On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
<[email protected]> wrote:

I'm watching this issue with interest, but I'm having troubleunderstanding

the bigger picture.  I am prototyping a system that uses Restlet to store

and index objects (mainly MS Office and OpenOffice documents and emails),so

I am planning to use Solr with Tika to index the objects.

I know nothing about DIH (Distributed Index Handler?), so I'm not surewhat

role it plays with Solr.  Is it a vendor-specific technology (from

Autonomy)? What does it do? Do you give it objects to index and ithandlesthem by passing it to one or more Solr/Tika indexing servers? And areyou

thinking that this would therefore be a good place to not only index the

objects, but also pass the information about the digital content toDROID?

Reading a bit about DROID (from TNA, The National Archives), it seemslike

it is used to capture information about the digital content of objects
stored in a content repository.  How does this fit with Solr?  I thought

Solr with Tika just did the indexing of text-based objects, but theactual

storage of the objects would be elsewhere (probably in the file system).
From what I can tell, DROID would operate on the file system objects, not
the indexing information.  Have I got this right?

Ideally, I would also like to convert any suitable content into PDF/Aformatfor long-term archival - probably not relevant to this issue, but IthoughtI'd mention it in case you see an application of this as part of emailand

attachment storage.

Sorry for all the questions, but hopefully someone could clarify this for
me!

Thanks very much
Ben Johnson

--------------------------------------------------
From: "Grant Ingersoll (JIRA)" <[email protected]>
Sent: Thursday, January 01, 2009 7:07 PM
To: <[email protected]>

Subject: [jira] Commented: (SOLR-934) Enable importing of mails into asolr

index through DIH.


  [
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210
]

Grant Ingersoll commented on SOLR-934:
--------------------------------------

Would it make more sense for DIH to farm out it's content acquisition toalibrary like Droids? Then, we could have real crawling, etc. allthrough a

pluggable connector framework.

Enable importing of mails into a solr index through DIH.
--------------------------------------------------------

               Key: SOLR-934
               URL: https://issues.apache.org/jira/browse/SOLR-934
           Project: Solr
        Issue Type: New Feature
        Components: contrib - DataImportHandler
  Affects Versions: 1.4
          Reporter: Preetam Rao
          Assignee: Shalin Shekhar Mangar
           Fix For: 1.4

       Attachments: SOLR-934.patch, SOLR-934.patch

 Original Estimate: 24h
 Remaining Estimate: 24h

Enable importing of mails into solr through DIH. Take one or moremailboxcredentials, download and index their content along with the contentfromattachments. The folders to fetch can be made configurable based onvariouscriteria. Apache Tika is used for extracting content from differentkinds ofattachments. JavaMail is used for mail box related operations likefetching

mails, filtering them etc.
The basic configuration for one mail box is as below:
{code:xml}
<document>
  <entity processor="MailEntityProcessor" user="[email protected]"
               password="something" host="imap.gmail.com"
protocol="imaps"/>
</document>
{code}
The below is the list of all configuration available:
{color:green}Required{color}
---------
*user*
*pwd*
*protocol*  (only "imaps" supported now)
*host*
{color:green}Optional{color}
---------
*folders* - comma seperated list of folders.

If not specified, default folder is used. Nested folders can bespecified

like a/b/c
*recurse* - index subfolders. Defaults to true.
*exclude* - comma seperated list of patterns.
*include* - comma seperated list of patterns.
*batchSize* - mails to fetch at once in a given folder.
Only headers can be prefetched in Javamail IMAP.
*readTimeout* - defaults to 60000ms
*conectTimeout* - defaults to 30000ms
*fetchSize* - IMAP config. 32KB default
*fetchMailsSince* -
date/time in miliiseconds, mails received after which will be fetched.
Useful for delta import.
*customFilter* - class name.
{code}
import javax.mail.Folder;
import javax.mail.SearchTerm;
clz implements MailEntityProcessor.CustomFilter() {
public SearchTerm getCustomSearch(Folder folder);
}
{code}
*processAttachement* - defaults to true
The below are the indexed fields.
{code}
 // Fields To Index
 // single valued
 private static final String SUBJECT = "subject";
 private static final String FROM = "from";
 private static final String SENT_DATE = "sentDate";
 private static final String XMAILER = "xMailer";
 // multi valued
 private static final String TO_CC_BCC = "allTo";
 private static final String FLAGS = "flags";
 private static final String CONTENT = "content";
 private static final String ATTACHMENT = "attachement";
 private static final String ATTACHMENT_NAMES = "attachementNames";
 // flag values
 private static final String FLAG_ANSWERED = "answered";
 private static final String FLAG_DELETED = "deleted";
 private static final String FLAG_DRAFT = "draft";
 private static final String FLAG_FLAGGED = "flagged";
 private static final String FLAG_RECENT = "recent";
 private static final String FLAG_SEEN = "seen";
{code}


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

--

--Noble Paul

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Reply via email to