Any experience with spring's lucene support?

2006-11-01 Thread lude

Hello,

while starting a new project we are thinking about using the
spring-modules for working with lucene. See:
https://springmodules.dev.java.net/

Does anybody has experience with this higher level lucene API?
How does it compare to Compass?
(Dis-)Advantages of using spring-modules lucene support?

Thanks
lude


Re: Any experience with spring's lucene support?

2006-11-02 Thread lude

Nobody here, who is using spring-modules?

On 11/1/06, lude <[EMAIL PROTECTED]> wrote:


Hello,

while starting a new project we are thinking about using the
spring-modules for working with lucene. See:
https://springmodules.dev.java.net/

Does anybody has experience with this higher level lucene API?
How does it compare to Compass?
(Dis-)Advantages of using spring-modules lucene support?

Thanks
lude



Index-Format difference between 1.4.3 and 2.0

2006-07-18 Thread lude

Hello,

sorry, didn't find the information elsewhere:

1.) Did the format of the lucene-index change between version 1.4.3 and 2.0?
2.) Is it possible to use the old Luke-Tool with a new lucene 2 index?

Thanks
lude


Re: Index-Format difference between 1.4.3 and 2.0

2006-07-19 Thread lude

Hi Nicolas,

thanks for answering.

You wrote:

And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to read a

1.4

What do you mean?
The luke website stated:  "Current version is 0.6. It has been released on
16 Feb 2005."
How can Luke be a Lucene-2 application if it was released on Feb 2005?

lude


On 7/18/06, Nicolas Lalevée <[EMAIL PROTECTED]> wrote:


Le Mardi 18 Juillet 2006 20:53, lude a écrit:
> Hello,
>
> sorry, didn't find the information elsewhere:
>
> 1.) Did the format of the lucene-index change between version 1.4.3 and
> 2.0? 2.) Is it possible to use the old Luke-Tool with a new lucene 2
index?
>
> Thanks
> lude

ASAIK, the format changed, but in a compatible way : a Lucene-2 app will
be
able to read a 1.4 index. The index format change is there :
http://lucene.apache.org/java/docs/fileformats.html

And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to read a
1.4
index.

Nicolas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Index-Format difference between 1.4.3 and 2.0

2006-07-20 Thread lude

As Luke was release with a Lucene-1.9 


Where did you get this information? From all I know Luke is based on Lucene
Version 1.4.3.


On 7/19/06, Nicolas Lalevée <[EMAIL PROTECTED]> wrote:


Le Mercredi 19 Juillet 2006 12:32, lude a écrit:
> Hi Nicolas,
>
> thanks for answering.
>
> You wrote:
> > And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to
read
> > a
>
> 1.4
>
> What do you mean?
> The luke website stated:  "Current version is 0.6. It has been released
on
> 16 Feb 2005."
> How can Luke be a Lucene-2 application if it was released on Feb 2005?

Yes, I should have not used the word Lucene-2 application, sorry.
BTW, about the index format described here :
http://lucene.apache.org/java/docs/fileformats.html
there is only concern about if Lucene >= 1.9 or if Lucene <= 1.4. As Luke
was
release with a Lucene-1.9, it will work with Lucene 2.0 index beacause a
index 1.9 and an index 2.0 are the same.

Nicolas.

>
> lude
>
> On 7/18/06, Nicolas Lalevée <[EMAIL PROTECTED]> wrote:
> > Le Mardi 18 Juillet 2006 20:53, lude a écrit:
> > > Hello,
> > >
> > > sorry, didn't find the information elsewhere:
> > >
> > > 1.) Did the format of the lucene-index change between version 1.4.3and
> > > 2.0? 2.) Is it possible to use the old Luke-Tool with a new lucene 2
> >
> > index?
> >
> > > Thanks
> > > lude
> >
> > ASAIK, the format changed, but in a compatible way : a Lucene-2 app
will
> > be
> > able to read a 1.4 index. The index format change is there :
> > http://lucene.apache.org/java/docs/fileformats.html
> >
> > And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to
read
> > a 1.4
> > index.
> >
> > Nicolas
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Best Practice: emails and file-attachments

2006-08-15 Thread lude

Hello,

does anybody has an idea what is the best design approch for realizing
the following:

The goal is to index emails and their corresponding file attachments.
One email could contain for example:

1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments  (each contains a 'file-name' and the
'file-content')

How should I build the index?

First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.

Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x  email_id, subject, sender_address, to_address, message_text
0..n x  email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.

Any thoughts?

Thanks
lude


Re: Best Practice: emails and file-attachments

2006-08-16 Thread lude

Hi John,

thanks for the detailed answer.

You wrote:

If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.


Does this mean you index just the first file-attachment?
What do you advice, if you have to index mulitpart bodys (== more then one
file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?

Greetings
lude



On 8/15/06, John Haxby <[EMAIL PROTECTED]> wrote:


lude wrote:
> does anybody has an idea what is the best design approch for realizing
> the following:
>
> The goal is to index emails and their corresponding file attachments.
> One email could contain for example:
I put a fair amount of thought into this when I was doing the design for
our mail server -- I know about mail :-)   After a little trial and
error I came up with the following scheme:

  1. All header fields indexed under their own name with the name
 converted to lower case.
  2. Almost all bodyparts indexed in a single field called BODY (in
 upper case)
  3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with
 uppercase fields
  4. Extensions for other bodypart-specific or application-specific
 fields indexed as something with an initial uppercase letter and
 at least one lowercase letter

That gives an extensible set of fields and does require that the index
knows ahead of time what header fields will be present or relevant.   It
means that there are potentially a lot of fields: we're running at about
60 depending on the user.

Some header fields are special.   The various message-id fields
(Message-Id, Resent-Message-Id, In-Reply-To and References) need to have
their mesage-ids carefully extracted and then indexed untokenized.
Recipient fields (to, cc, from, etc) need to parsed and then have their
addresses re-assembled as a friendly-name and an RFC822 address -- the
reason for the re-assembly is that addresses can be presented in
equivalent but odd fashions.   Most header fields can have RFC2047
encoded text which needs to be decoded.

When indexing the bodyparts you need to be a little careful.   In
general, the MIME headers for each part are all indexed as other message
headers (content-id is a messge id field) and I also indexed the
canonical content type under a CONTENT-TYPE field, again to get rid of
fluff so that I can search for, say,
CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly
huge messages :-)  An attached message probably doesn't want all its
headers indexed: subject is good; recipients are probably bad as it'll
confuse the normal search and give unexpected results; message-id fields
are almost certainly a bad idea.  If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.

Does that all make sense?  Javamail is great for this, it's good at
parsing and extracting the content of messages.  However, it's not
enough to just read what I've said and the javamail doc.   If you're not
intimately familiar with the MIME RFCs (I think the first one is
RFC2045, but their not difficult to find as their all around RFC2047)
and RFC2822, the message structure RFC itself.   If you just guess
because the structure is "obvious" you'll come unstuck.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Best Practice: emails and file-attachments

2006-08-16 Thread lude

Hi Dejan,

how do you query for email- and(!) attachment-documents,
if you just want to present one hit per email (even if the searchterm
matches
in the email- and(!) in the corresponding attachment-document)?

Thanks
lude


On 8/15/06, Dejan Nenov <[EMAIL PROTECTED]> wrote:


The approach we I find best is to create both Email documents - where a
list
(and links) to all attachments is contained as well as individual
Attachment
documents.

It gets a little tricky when you have a forwarded email, containing an
original Email that contains a tar.gz attachment, which contains the
"actual" attached files :)

(Shameless promotion follows) If you are a Windows user, for a _very_ good
example get a copy of X1 Desktop (free - also distributed as Yahoo!
Desktop
search) - then right-click on the column headers and look at the available
fields for email.


Dejan

-Original Message-
From: lude [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 15, 2006 10:29 AM
To: java-user@lucene.apache.org
Subject: Best Practice: emails and file-attachments

Hello,

does anybody has an idea what is the best design approch for realizing
the following:

The goal is to index emails and their corresponding file attachments.
One email could contain for example:

1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments  (each contains a 'file-name' and the
'file-content')

How should I build the index?

First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.

Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x  email_id, subject, sender_address, to_address, message_text
0..n x  email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.

Any thoughts?

Thanks
lude


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Best Practice: emails and file-attachments

2006-08-16 Thread lude

Hi Johan,

thanks again for the many words and explanations!


You also mentioned indexing each bodypart ("attachment") separately.
Why? 
To my mind, there is no use case where it makes sense to search a

particular bodypart

I will give you the use case:

1.)  User searches for "abcd"
2.) Lucene matches the searchterm (at least) two times:
   - One email has the term in the plain message text   (mail-1)
   - One email contains five "file-attachment". One of this files matches
the search term  (mail-2)
3.) The result list would show this:
   1.   mail-1  'subject'
'Abstract of the message-text'
   2.   mail-2 'subject'
Attachment with name 'filename.doc' contains 'Abstract of
file-content'

Another Use-Case would be an extended search, which allows to select if
"attached files"
should be searched (yes or no).

Greetings
lude









Singleton and IndexModifier

2006-08-20 Thread lude

Hello,

when using the new IndexModifier of Lucene 2.0, what would be
the best creation-pattern?

Should there be one IndexModifier instance in the application (==singelton)?
Could an IndexModifier be opened for a longer time or should it be created
on use and immediately closed?

Another issue:
- I create an IndexModifier
- The applicaton crashes
- There exists a write-lock on the index
--> Next time I start the application the IndexModifier couldn't be opened
because of the locks.

What is the right way to check and delete old write locks?

Thanks
lude


Re: Singleton and IndexModifier

2006-08-21 Thread lude

Thanks simon.

In practice my application would have around 100 queries and around 10
add/deletes per minute.
Add/deletes should show up immediately.
That means that I should always create and close an IndexModifier (and
IndexReader for Searching) for each operation, right?

Sure, it cost's a litte performance. But it ensures that
1.)  The change is visible immediately
2.)  The write.lock (and commit.lock) doesn't remain when the application
shut down or crashes


On 8/20/06, Simon Willnauer <[EMAIL PROTECTED]> wrote:


On 8/20/06, lude <[EMAIL PROTECTED]> wrote:
> Hello,
>
> when using the new IndexModifier of Lucene 2.0, what would be
> the best creation-pattern?
>
> Should there be one IndexModifier instance in the application
(==singelton)?
> Could an IndexModifier be opened for a longer time or should it be
created
> on use and immediately closed?
You create an indexmodifier if you want to modify your index. if you
wanna commit your data (make it available via index reader / searcher)
you close or rather flush your indexmodifier. After closing the
modifier you can create a new searcher and all modifications are
visible to the searcher / reader. Basically there is one index
modifier (or one single modifying instances per index as you must not
modify your index with more that one instance of indexreader -writer
-modifier). If you use the flush() method you don't need to create a
new IndexModifier instance.

You can leave your indexmod. open for a long time but without a commit
the indexed or deleted documents won't be visible to your searcher /
reader) Opening indexmodifiers is a heavy operation which should be
used carefully so you can close your modifier every n docs or after a
certain idle time.
Just be aware that IndexModifier uses IndexReader and IndexWriter
internally so if you do a delete after a addDocument the indexwriter
will be closed and a new indexreader will be opened. This is also a
heavy operation in the meaning of performance.
You could keep your deletes until you want to flush your instance of
IndexModifier to gain a bit of performance.
>
> Another issue:
> - I create an IndexModifier
> - The applicaton crashes
> - There exists a write-lock on the index
> --> Next time I start the application the IndexModifier couldn't be
opened
> because of the locks.
>
> What is the right way to check and delete old write locks?
You can use the IndexReader.unlock(Directory dir) method the java doc
says:

  * Caution: this should only be used by failure recovery code,
  * when it is known that no other process nor thread is in fact
  * currently accessing this index.

To make sure this happens only in recovery mode.

best regards Simon
>
> Thanks
> lude
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Singleton and IndexModifier

2006-08-21 Thread lude

Ok, you've got me! ;)

How do you assure that your IndexModifier (or IndexWriter/IndexReader)
is closed, when your application ends.

Or do you always use a IndexReader.unlock(Directory dir) at startup-time of
your application.

Thanks!
lude

On 8/21/06, Simon Willnauer <[EMAIL PROTECTED]> wrote:


In GDataServer I use a timed indexer who commits the modifications
after a certain idle time or after n documents insert/update/delete.
This ensures that your modifications will be available after a defined
time. it also minimize opening and closing readers and writers as the
deletes will be done after all inserts. I do really recommend to keep
your searcher open as long as possible and keep your index modifiers
open and close action to a minimum.

best regards Simon

On 8/21/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
> my only caution is that as your index grows, the close/open of your
readers
> may take more time than you are willing to spend. Not that I'm
recommending
> against it as I don't know the details, but it's something to keep an
eye
> on. In my experience, "immediately available" may really mean "available
> after 10 minutes", so you can think about allowing a little latency to
> improve query speed if that becomes an issue.
>
> Best
> Erick
>
> On 8/21/06, lude <[EMAIL PROTECTED]> wrote:
> >
> > Thanks simon.
> >
> > In practice my application would have around 100 queries and around 10
> > add/deletes per minute.
> > Add/deletes should show up immediately.
> > That means that I should always create and close an IndexModifier (and
> > IndexReader for Searching) for each operation, right?
> >
> > Sure, it cost's a litte performance. But it ensures that
> > 1.)  The change is visible immediately
> > 2.)  The write.lock (and commit.lock) doesn't remain when the
application
> > shut down or crashes
> >
> >
> > On 8/20/06, Simon Willnauer <[EMAIL PROTECTED]> wrote:
> > >
> > > On 8/20/06, lude <[EMAIL PROTECTED]> wrote:
> > > > Hello,
> > > >
> > > > when using the new IndexModifier of Lucene 2.0, what would be
> > > > the best creation-pattern?
> > > >
> > > > Should there be one IndexModifier instance in the application
> > > (==singelton)?
> > > > Could an IndexModifier be opened for a longer time or should it be
> > > created
> > > > on use and immediately closed?
> > > You create an indexmodifier if you want to modify your index. if you
> > > wanna commit your data (make it available via index reader /
searcher)
> > > you close or rather flush your indexmodifier. After closing the
> > > modifier you can create a new searcher and all modifications are
> > > visible to the searcher / reader. Basically there is one index
> > > modifier (or one single modifying instances per index as you must
not
> > > modify your index with more that one instance of indexreader -writer
> > > -modifier). If you use the flush() method you don't need to create a
> > > new IndexModifier instance.
> > >
> > > You can leave your indexmod. open for a long time but without a
commit
> > > the indexed or deleted documents won't be visible to your searcher /
> > > reader) Opening indexmodifiers is a heavy operation which should be
> > > used carefully so you can close your modifier every n docs or after
a
> > > certain idle time.
> > > Just be aware that IndexModifier uses IndexReader and IndexWriter
> > > internally so if you do a delete after a addDocument the indexwriter
> > > will be closed and a new indexreader will be opened. This is also a
> > > heavy operation in the meaning of performance.
> > > You could keep your deletes until you want to flush your instance of
> > > IndexModifier to gain a bit of performance.
> > > >
> > > > Another issue:
> > > > - I create an IndexModifier
> > > > - The applicaton crashes
> > > > - There exists a write-lock on the index
> > > > --> Next time I start the application the IndexModifier couldn't
be
> > > opened
> > > > because of the locks.
> > > >
> > > > What is the right way to check and delete old write locks?
> > > You can use the IndexReader.unlock(Directory dir) method the java
doc
> > > says:
> > >
> > >   * Caution: this should only be used by failure recovery code,
> > >   * when it is known that no other process nor thread is in fact
> > >   * currently accessing this index.
> > >
> > > To make sure this happens only in recovery mode.
> > >
> > > best regards Simon
> > > >
> > > > Thanks
> > > > lude
> > > >
> > > >
> > >
> > >
-
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Index-Format difference between 1.4.3 and 2.0

2006-08-25 Thread lude

Hi Andrzej,

a month ago you mentioned a new Lucene 2.0 compatible Version of luke.
Does it exist somewhere?

Thanks
lude


On 7/20/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:


lude wrote:
>> As Luke was release with a Lucene-1.9 
>
> Where did you get this information? From all I know Luke is based on
> Lucene
> Version 1.4.3.
>

The latest version of Luke was released with an early snapshot of 1.9. I
plan to release a 2.0-based version in a few days.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




QueryParser returns all documents

2006-09-04 Thread lude

Hello,

a simple, stupid question to the friendly mailinglist:

How do you have to define the query-string to get all documents of an index
be returned by using the QueryParser?
In theory a query like 'NOT word_not_in_index' should find and return all
documents. In practice this doesn't work (no documents are found).

Greetings
lude


Re: QueryParser returns all documents

2006-09-05 Thread lude

Why would you want to do this?


This is a 'feature-request' of our searchengine.
The user should have the possibilty to query for all(!) documents.
This would allow him to see all available document listet.

Is there a simple way to define a query that returns all documents of an
index?

Thanks lude

On 9/4/06, karl wettin <[EMAIL PROTECTED]> wrote:


On Mon, 2006-09-04 at 22:32 +0200, lude wrote:

> How do you have to define the query-string to get all documents of an
index
> be returned by using the QueryParser?
> In theory a query like 'NOT word_not_in_index' should find and return
all
> documents. In practice this doesn't work (no documents are found).

Why would you want to do this?

Use IndexReader methods to iterate all documents.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Status: Sorting on tokenized fields

2006-09-22 Thread lude

Hello,

for years there is the discussion to make lucene able to sort on TOKENIZED
fields.
(e.g. if more then one term is available concatenate the tokens OR use the
stored value for sorting).

Could some of the developer please make a statement if there are any
plans to implement this feature in future?

Thanks
lude


Re: Status: Sorting on tokenized fields

2006-09-25 Thread lude

Hi Chris,

sure, you can create an addional field for every field that should
support sorting.

In our application we need to do this for all 20 fields. That means
me have to create twenty redundant fields just for sorting.
That's really an overhead in size and indexing-time.

:: using the stored value doesn't help: there can be multiple stored values
:: just as easily as there can be multiple tokens.

In what situations do you have more than one stored value per field?
In our applications we always have one stored value for each field.
In such a situation it would be perfect to take these stored values for
sorting
Wouldn't it?

Greetings
lude



On 9/23/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: for years there is the discussion to make lucene able to sort on
TOKENIZED
: fields.

really? .. i've only been on the list since 1.4.3 but i don't remember it
being much of a recurring topic.

: (e.g. if more then one term is available concatenate the tokens OR use
the
: stored value for sorting).

using the stored value doesn't help: there can be multiple stored values
just as easily as there can be multiple tokens.

concatenating the tokens is a vague concept that would be very hard to get
right in a way that would work genericly:  for starters, how do you deal
with tokens at the same position? (ie; synonyms)

In my experience, the best way to deal with this is for the application
using Lucene to decide which fields it wants to sort on, and make a
"sortable" version of that field that is indexed by not tokenized -- the
application is afterall in teh best position to decide how exactly it
wnats to "sort" on the data (ie: should the values be lowercased so the
sort is case-insensetive?  should certain punctution characters be striped
out? etc...)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]