Any experience with spring's lucene support?
Hello, while starting a new project we are thinking about using the spring-modules for working with lucene. See: https://springmodules.dev.java.net/ Does anybody has experience with this higher level lucene API? How does it compare to Compass? (Dis-)Advantages of using spring-modules lucene support? Thanks lude
Re: Any experience with spring's lucene support?
Nobody here, who is using spring-modules? On 11/1/06, lude <[EMAIL PROTECTED]> wrote: Hello, while starting a new project we are thinking about using the spring-modules for working with lucene. See: https://springmodules.dev.java.net/ Does anybody has experience with this higher level lucene API? How does it compare to Compass? (Dis-)Advantages of using spring-modules lucene support? Thanks lude
Index-Format difference between 1.4.3 and 2.0
Hello, sorry, didn't find the information elsewhere: 1.) Did the format of the lucene-index change between version 1.4.3 and 2.0? 2.) Is it possible to use the old Luke-Tool with a new lucene 2 index? Thanks lude
Re: Index-Format difference between 1.4.3 and 2.0
Hi Nicolas, thanks for answering. You wrote: And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to read a 1.4 What do you mean? The luke website stated: "Current version is 0.6. It has been released on 16 Feb 2005." How can Luke be a Lucene-2 application if it was released on Feb 2005? lude On 7/18/06, Nicolas Lalevée <[EMAIL PROTECTED]> wrote: Le Mardi 18 Juillet 2006 20:53, lude a écrit: > Hello, > > sorry, didn't find the information elsewhere: > > 1.) Did the format of the lucene-index change between version 1.4.3 and > 2.0? 2.) Is it possible to use the old Luke-Tool with a new lucene 2 index? > > Thanks > lude ASAIK, the format changed, but in a compatible way : a Lucene-2 app will be able to read a 1.4 index. The index format change is there : http://lucene.apache.org/java/docs/fileformats.html And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to read a 1.4 index. Nicolas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index-Format difference between 1.4.3 and 2.0
As Luke was release with a Lucene-1.9 Where did you get this information? From all I know Luke is based on Lucene Version 1.4.3. On 7/19/06, Nicolas Lalevée <[EMAIL PROTECTED]> wrote: Le Mercredi 19 Juillet 2006 12:32, lude a écrit: > Hi Nicolas, > > thanks for answering. > > You wrote: > > And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to read > > a > > 1.4 > > What do you mean? > The luke website stated: "Current version is 0.6. It has been released on > 16 Feb 2005." > How can Luke be a Lucene-2 application if it was released on Feb 2005? Yes, I should have not used the word Lucene-2 application, sorry. BTW, about the index format described here : http://lucene.apache.org/java/docs/fileformats.html there is only concern about if Lucene >= 1.9 or if Lucene <= 1.4. As Luke was release with a Lucene-1.9, it will work with Lucene 2.0 index beacause a index 1.9 and an index 2.0 are the same. Nicolas. > > lude > > On 7/18/06, Nicolas Lalevée <[EMAIL PROTECTED]> wrote: > > Le Mardi 18 Juillet 2006 20:53, lude a écrit: > > > Hello, > > > > > > sorry, didn't find the information elsewhere: > > > > > > 1.) Did the format of the lucene-index change between version 1.4.3and > > > 2.0? 2.) Is it possible to use the old Luke-Tool with a new lucene 2 > > > > index? > > > > > Thanks > > > lude > > > > ASAIK, the format changed, but in a compatible way : a Lucene-2 app will > > be > > able to read a 1.4 index. The index format change is there : > > http://lucene.apache.org/java/docs/fileformats.html > > > > And about Luke, ASAIK too, is a Lucene-2 app, so it will be able to read > > a 1.4 > > index. > > > > Nicolas > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Best Practice: emails and file-attachments
Hello, does anybody has an idea what is the best design approch for realizing the following: The goal is to index emails and their corresponding file attachments. One email could contain for example: 1 x subject 1 x sender-address 1 x to-addresses 1 x message-text 0..n x file-attachments (each contains a 'file-name' and the 'file-content') How should I build the index? First approach: Each email + attachments gets one document with the following fields: subject, sender_address, to_address, message_text, 1_attachment_name, 1_attachment_content, 2_attachment_name, 2_attachment_content, 3_attachment_name, 3_attachment_content Disadvantage: Only three attachments could be indexed. It isn't a generic solution for indexing 'n' file-attachments. Second approach: Each email gets one document with the main email-data and 0 to n documents of file-attachments: 1 x email_id, subject, sender_address, to_address, message_text 0..n x email_id, attachment_name, attachment_content Disadvantage: At query time it is difficult to aggregate the documents that belongs to each other. One hit per email (including attachments) should be shown. Any thoughts? Thanks lude
Re: Best Practice: emails and file-attachments
Hi John, thanks for the detailed answer. You wrote: If you're indexing a multipart/alternative bodypart then index all the MIME headers, but only index the content of the *first* bodypart. Does this mean you index just the first file-attachment? What do you advice, if you have to index mulitpart bodys (== more then one file-attachment)? One lucene-document for each part (==file)? How do you handle the queries? Greetings lude On 8/15/06, John Haxby <[EMAIL PROTECTED]> wrote: lude wrote: > does anybody has an idea what is the best design approch for realizing > the following: > > The goal is to index emails and their corresponding file attachments. > One email could contain for example: I put a fair amount of thought into this when I was doing the design for our mail server -- I know about mail :-) After a little trial and error I came up with the following scheme: 1. All header fields indexed under their own name with the name converted to lower case. 2. Almost all bodyparts indexed in a single field called BODY (in upper case) 3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with uppercase fields 4. Extensions for other bodypart-specific or application-specific fields indexed as something with an initial uppercase letter and at least one lowercase letter That gives an extensible set of fields and does require that the index knows ahead of time what header fields will be present or relevant. It means that there are potentially a lot of fields: we're running at about 60 depending on the user. Some header fields are special. The various message-id fields (Message-Id, Resent-Message-Id, In-Reply-To and References) need to have their mesage-ids carefully extracted and then indexed untokenized. Recipient fields (to, cc, from, etc) need to parsed and then have their addresses re-assembled as a friendly-name and an RFC822 address -- the reason for the re-assembly is that addresses can be presented in equivalent but odd fashions. Most header fields can have RFC2047 encoded text which needs to be decoded. When indexing the bodyparts you need to be a little careful. In general, the MIME headers for each part are all indexed as other message headers (content-id is a messge id field) and I also indexed the canonical content type under a CONTENT-TYPE field, again to get rid of fluff so that I can search for, say, CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly huge messages :-) An attached message probably doesn't want all its headers indexed: subject is good; recipients are probably bad as it'll confuse the normal search and give unexpected results; message-id fields are almost certainly a bad idea. If you're indexing a multipart/alternative bodypart then index all the MIME headers, but only index the content of the *first* bodypart. Does that all make sense? Javamail is great for this, it's good at parsing and extracting the content of messages. However, it's not enough to just read what I've said and the javamail doc. If you're not intimately familiar with the MIME RFCs (I think the first one is RFC2045, but their not difficult to find as their all around RFC2047) and RFC2822, the message structure RFC itself. If you just guess because the structure is "obvious" you'll come unstuck. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practice: emails and file-attachments
Hi Dejan, how do you query for email- and(!) attachment-documents, if you just want to present one hit per email (even if the searchterm matches in the email- and(!) in the corresponding attachment-document)? Thanks lude On 8/15/06, Dejan Nenov <[EMAIL PROTECTED]> wrote: The approach we I find best is to create both Email documents - where a list (and links) to all attachments is contained as well as individual Attachment documents. It gets a little tricky when you have a forwarded email, containing an original Email that contains a tar.gz attachment, which contains the "actual" attached files :) (Shameless promotion follows) If you are a Windows user, for a _very_ good example get a copy of X1 Desktop (free - also distributed as Yahoo! Desktop search) - then right-click on the column headers and look at the available fields for email. Dejan -Original Message- From: lude [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 15, 2006 10:29 AM To: java-user@lucene.apache.org Subject: Best Practice: emails and file-attachments Hello, does anybody has an idea what is the best design approch for realizing the following: The goal is to index emails and their corresponding file attachments. One email could contain for example: 1 x subject 1 x sender-address 1 x to-addresses 1 x message-text 0..n x file-attachments (each contains a 'file-name' and the 'file-content') How should I build the index? First approach: Each email + attachments gets one document with the following fields: subject, sender_address, to_address, message_text, 1_attachment_name, 1_attachment_content, 2_attachment_name, 2_attachment_content, 3_attachment_name, 3_attachment_content Disadvantage: Only three attachments could be indexed. It isn't a generic solution for indexing 'n' file-attachments. Second approach: Each email gets one document with the main email-data and 0 to n documents of file-attachments: 1 x email_id, subject, sender_address, to_address, message_text 0..n x email_id, attachment_name, attachment_content Disadvantage: At query time it is difficult to aggregate the documents that belongs to each other. One hit per email (including attachments) should be shown. Any thoughts? Thanks lude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practice: emails and file-attachments
Hi Johan, thanks again for the many words and explanations! You also mentioned indexing each bodypart ("attachment") separately. Why? To my mind, there is no use case where it makes sense to search a particular bodypart I will give you the use case: 1.) User searches for "abcd" 2.) Lucene matches the searchterm (at least) two times: - One email has the term in the plain message text (mail-1) - One email contains five "file-attachment". One of this files matches the search term (mail-2) 3.) The result list would show this: 1. mail-1 'subject' 'Abstract of the message-text' 2. mail-2 'subject' Attachment with name 'filename.doc' contains 'Abstract of file-content' Another Use-Case would be an extended search, which allows to select if "attached files" should be searched (yes or no). Greetings lude
Singleton and IndexModifier
Hello, when using the new IndexModifier of Lucene 2.0, what would be the best creation-pattern? Should there be one IndexModifier instance in the application (==singelton)? Could an IndexModifier be opened for a longer time or should it be created on use and immediately closed? Another issue: - I create an IndexModifier - The applicaton crashes - There exists a write-lock on the index --> Next time I start the application the IndexModifier couldn't be opened because of the locks. What is the right way to check and delete old write locks? Thanks lude
Re: Singleton and IndexModifier
Thanks simon. In practice my application would have around 100 queries and around 10 add/deletes per minute. Add/deletes should show up immediately. That means that I should always create and close an IndexModifier (and IndexReader for Searching) for each operation, right? Sure, it cost's a litte performance. But it ensures that 1.) The change is visible immediately 2.) The write.lock (and commit.lock) doesn't remain when the application shut down or crashes On 8/20/06, Simon Willnauer <[EMAIL PROTECTED]> wrote: On 8/20/06, lude <[EMAIL PROTECTED]> wrote: > Hello, > > when using the new IndexModifier of Lucene 2.0, what would be > the best creation-pattern? > > Should there be one IndexModifier instance in the application (==singelton)? > Could an IndexModifier be opened for a longer time or should it be created > on use and immediately closed? You create an indexmodifier if you want to modify your index. if you wanna commit your data (make it available via index reader / searcher) you close or rather flush your indexmodifier. After closing the modifier you can create a new searcher and all modifications are visible to the searcher / reader. Basically there is one index modifier (or one single modifying instances per index as you must not modify your index with more that one instance of indexreader -writer -modifier). If you use the flush() method you don't need to create a new IndexModifier instance. You can leave your indexmod. open for a long time but without a commit the indexed or deleted documents won't be visible to your searcher / reader) Opening indexmodifiers is a heavy operation which should be used carefully so you can close your modifier every n docs or after a certain idle time. Just be aware that IndexModifier uses IndexReader and IndexWriter internally so if you do a delete after a addDocument the indexwriter will be closed and a new indexreader will be opened. This is also a heavy operation in the meaning of performance. You could keep your deletes until you want to flush your instance of IndexModifier to gain a bit of performance. > > Another issue: > - I create an IndexModifier > - The applicaton crashes > - There exists a write-lock on the index > --> Next time I start the application the IndexModifier couldn't be opened > because of the locks. > > What is the right way to check and delete old write locks? You can use the IndexReader.unlock(Directory dir) method the java doc says: * Caution: this should only be used by failure recovery code, * when it is known that no other process nor thread is in fact * currently accessing this index. To make sure this happens only in recovery mode. best regards Simon > > Thanks > lude > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Singleton and IndexModifier
Ok, you've got me! ;) How do you assure that your IndexModifier (or IndexWriter/IndexReader) is closed, when your application ends. Or do you always use a IndexReader.unlock(Directory dir) at startup-time of your application. Thanks! lude On 8/21/06, Simon Willnauer <[EMAIL PROTECTED]> wrote: In GDataServer I use a timed indexer who commits the modifications after a certain idle time or after n documents insert/update/delete. This ensures that your modifications will be available after a defined time. it also minimize opening and closing readers and writers as the deletes will be done after all inserts. I do really recommend to keep your searcher open as long as possible and keep your index modifiers open and close action to a minimum. best regards Simon On 8/21/06, Erick Erickson <[EMAIL PROTECTED]> wrote: > my only caution is that as your index grows, the close/open of your readers > may take more time than you are willing to spend. Not that I'm recommending > against it as I don't know the details, but it's something to keep an eye > on. In my experience, "immediately available" may really mean "available > after 10 minutes", so you can think about allowing a little latency to > improve query speed if that becomes an issue. > > Best > Erick > > On 8/21/06, lude <[EMAIL PROTECTED]> wrote: > > > > Thanks simon. > > > > In practice my application would have around 100 queries and around 10 > > add/deletes per minute. > > Add/deletes should show up immediately. > > That means that I should always create and close an IndexModifier (and > > IndexReader for Searching) for each operation, right? > > > > Sure, it cost's a litte performance. But it ensures that > > 1.) The change is visible immediately > > 2.) The write.lock (and commit.lock) doesn't remain when the application > > shut down or crashes > > > > > > On 8/20/06, Simon Willnauer <[EMAIL PROTECTED]> wrote: > > > > > > On 8/20/06, lude <[EMAIL PROTECTED]> wrote: > > > > Hello, > > > > > > > > when using the new IndexModifier of Lucene 2.0, what would be > > > > the best creation-pattern? > > > > > > > > Should there be one IndexModifier instance in the application > > > (==singelton)? > > > > Could an IndexModifier be opened for a longer time or should it be > > > created > > > > on use and immediately closed? > > > You create an indexmodifier if you want to modify your index. if you > > > wanna commit your data (make it available via index reader / searcher) > > > you close or rather flush your indexmodifier. After closing the > > > modifier you can create a new searcher and all modifications are > > > visible to the searcher / reader. Basically there is one index > > > modifier (or one single modifying instances per index as you must not > > > modify your index with more that one instance of indexreader -writer > > > -modifier). If you use the flush() method you don't need to create a > > > new IndexModifier instance. > > > > > > You can leave your indexmod. open for a long time but without a commit > > > the indexed or deleted documents won't be visible to your searcher / > > > reader) Opening indexmodifiers is a heavy operation which should be > > > used carefully so you can close your modifier every n docs or after a > > > certain idle time. > > > Just be aware that IndexModifier uses IndexReader and IndexWriter > > > internally so if you do a delete after a addDocument the indexwriter > > > will be closed and a new indexreader will be opened. This is also a > > > heavy operation in the meaning of performance. > > > You could keep your deletes until you want to flush your instance of > > > IndexModifier to gain a bit of performance. > > > > > > > > Another issue: > > > > - I create an IndexModifier > > > > - The applicaton crashes > > > > - There exists a write-lock on the index > > > > --> Next time I start the application the IndexModifier couldn't be > > > opened > > > > because of the locks. > > > > > > > > What is the right way to check and delete old write locks? > > > You can use the IndexReader.unlock(Directory dir) method the java doc > > > says: > > > > > > * Caution: this should only be used by failure recovery code, > > > * when it is known that no other process nor thread is in fact > > > * currently accessing this index. > > > > > > To make sure this happens only in recovery mode. > > > > > > best regards Simon > > > > > > > > Thanks > > > > lude > > > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index-Format difference between 1.4.3 and 2.0
Hi Andrzej, a month ago you mentioned a new Lucene 2.0 compatible Version of luke. Does it exist somewhere? Thanks lude On 7/20/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: lude wrote: >> As Luke was release with a Lucene-1.9 > > Where did you get this information? From all I know Luke is based on > Lucene > Version 1.4.3. > The latest version of Luke was released with an early snapshot of 1.9. I plan to release a 2.0-based version in a few days. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser returns all documents
Hello, a simple, stupid question to the friendly mailinglist: How do you have to define the query-string to get all documents of an index be returned by using the QueryParser? In theory a query like 'NOT word_not_in_index' should find and return all documents. In practice this doesn't work (no documents are found). Greetings lude
Re: QueryParser returns all documents
Why would you want to do this? This is a 'feature-request' of our searchengine. The user should have the possibilty to query for all(!) documents. This would allow him to see all available document listet. Is there a simple way to define a query that returns all documents of an index? Thanks lude On 9/4/06, karl wettin <[EMAIL PROTECTED]> wrote: On Mon, 2006-09-04 at 22:32 +0200, lude wrote: > How do you have to define the query-string to get all documents of an index > be returned by using the QueryParser? > In theory a query like 'NOT word_not_in_index' should find and return all > documents. In practice this doesn't work (no documents are found). Why would you want to do this? Use IndexReader methods to iterate all documents. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Status: Sorting on tokenized fields
Hello, for years there is the discussion to make lucene able to sort on TOKENIZED fields. (e.g. if more then one term is available concatenate the tokens OR use the stored value for sorting). Could some of the developer please make a statement if there are any plans to implement this feature in future? Thanks lude
Re: Status: Sorting on tokenized fields
Hi Chris, sure, you can create an addional field for every field that should support sorting. In our application we need to do this for all 20 fields. That means me have to create twenty redundant fields just for sorting. That's really an overhead in size and indexing-time. :: using the stored value doesn't help: there can be multiple stored values :: just as easily as there can be multiple tokens. In what situations do you have more than one stored value per field? In our applications we always have one stored value for each field. In such a situation it would be perfect to take these stored values for sorting Wouldn't it? Greetings lude On 9/23/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : for years there is the discussion to make lucene able to sort on TOKENIZED : fields. really? .. i've only been on the list since 1.4.3 but i don't remember it being much of a recurring topic. : (e.g. if more then one term is available concatenate the tokens OR use the : stored value for sorting). using the stored value doesn't help: there can be multiple stored values just as easily as there can be multiple tokens. concatenating the tokens is a vague concept that would be very hard to get right in a way that would work genericly: for starters, how do you deal with tokens at the same position? (ie; synonyms) In my experience, the best way to deal with this is for the application using Lucene to decide which fields it wants to sort on, and make a "sortable" version of that field that is indexed by not tokenized -- the application is afterall in teh best position to decide how exactly it wnats to "sort" on the data (ie: should the values be lowercased so the sort is case-insensetive? should certain punctution characters be striped out? etc...) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]