speeding up lucene search
Hello guys, What are some general techniques to make lucene search faster? I'm thinking about splitting up the index. My current index has approx 1.8 million documents (small documents) and index size is about 550MB. Am I likely to get much gain out of splitting it up and use a multiparallelsearcher? Most of my search queries search queries search on 5-10 fields. Are there other things I should look at? Thanks to all, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Token or not Token, PerFieldAnalyzer
I still don't understand something, my analyzer contains a tokenizer, turning "hello world" into [hello] [world] is this analyzer applied on non-tokenized field? What exactly is done on a field when the boolean token is set to true? -- Florian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sorting on tokenized fields
I see in the Javadoc that it is only possible to sort on fields that are not tokenized, I have two questions about that: 1) What happens if the field is tokenized, is sorting done anyway, using the first term only? 2) Is there a way to do some sorting anyway, by concatenating all the tokens into one string? -- Florian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. MySQL Full-Text
On Jul 20, 2004, at 12:29 PM, Tim Brennan wrote: Someone came into my office today and asked me about the project I am trying to Lucene for -- "why aren't you just using a MySQL full-text index to do that" -- after thinking about it for a few minutes, I realized I don't have a great answer. MySQL builds inverted indexes for (in theory) doing the same type of lookup that lucene does. You'd maybe have to build some kind of a layer on the front to mimic Lucene's analyzers, but that wouldn't be too hard My only experience with MySQLfulltext is trivial test apps -- but the MySQL world does have some significant advantages (its a known quantity from an operations perspective, etc). Does anyone out there have anything more concrete they can add? --tim I'd say that MySQL full text is much slower if you have a lot of data... that is one of the reasons we started using lucene (We had a mysql db to do the search), it's way faster! -- Florian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. MySQL Full-Text
On Tuesday 20 July 2004 21:29, Tim Brennan wrote: > ÂDoes anyone out there have > anything more concrete they can add? Stemming is still on the MySQL TODO list: http://dev.mysql.com/doc/mysql/en/Fulltext_TODO.html Also, for most people it's easier to extend Lucene than MySQL (as MySQL is written in C(++?)) and there are more powerful queries in Lucene, e.g. fuzzy phrase search. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizers and java.text.BreakIterator
Answering my own question, I think it is b/c Tokenizer's work with a Reader and you would have to read in the whole document in order to use the BreakIterator, which operates on a String... >>> [EMAIL PROTECTED] 07/20/04 03:23PM >>> Hi, Was wondering if anyone uses java.text.BreakIterator#getWordInstance(Locale) as a tokenizer for various languages? Does it do a good job? It seems like it does, at least for languages where words are separated by spaces or punctuation, but I have only done simple tests. Anyone have any thoughts on this? What am I missing? Does this seem like a valid approach? Thanks, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Limiting Term Queries
Is it possible to limit a term query? For example: I am indexing documents with (amongst other things) a string in one field and with a number in another field. All combinations of strings and numbers are allowed and neither field is unique. I would like a way to query Lucene to pull out all unique numbers for a specific letter. If I had: (a, 123) (b, 123) (a, 123) (b, 23) (a, 45) I would want a way to pull all unqiue numbers such that the string is 'a': - (a, 123) (a, 45) Right now I am determining the unique numbers by performing a term query: TermEnum enumerator = reader.terms(new Term(number_field, "")); where reader is an IndexReader and number_field is the field containing the number. This gives me a list of all unique numbers, but counts those documents that might have different letters (i.e. not just 'a'). Any thoughts on this? Shawn. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
It seems to me the answer to this is not necessarily to open up the API, but to provide a mechanism for adding Writers and Readers to the indexing/searching process at the application level. These readers and writers could be passed to Lucene and used to read and write to separate files (thus, not harming the index file format). They could be used to read/write an arbitrary amount of metadata at the term, document and/or index level w/o affecting the core Lucene index. Furthermore, previous versions could still work b/c they would just ignore the new files and the indexes could be used by other applications as well. This is just a thought in the infancy stage, but it seems like it would solve the problem. Of course, the trick is figuring out how it fits into the API (or maybe it becomes a part of 2.0). Not sure if it is even feasible, but it seems like you could define interfaces for Readers and Writers that met the requirements to do this. This may be better discussed on the dev list. >>> [EMAIL PROTECTED] 07/20/04 11:28AM >>> Hi: I am trying to store some Databased like field values into lucene. I have my own way of storing field values in a customized format. I guess my question is wheather we can make the Reader/Writer classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes non-final? I have asked to make the Lucene API less restrictive many many many times but got no replies. Is this request feasible? Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene vs. MySQL Full-Text
Someone came into my office today and asked me about the project I am trying to Lucene for -- "why aren't you just using a MySQL full-text index to do that" -- after thinking about it for a few minutes, I realized I don't have a great answer. MySQL builds inverted indexes for (in theory) doing the same type of lookup that lucene does. You'd maybe have to build some kind of a layer on the front to mimic Lucene's analyzers, but that wouldn't be too hard My only experience with MySQLfulltext is trivial test apps -- but the MySQL world does have some significant advantages (its a known quantity from an operations perspective, etc). Does anyone out there have anything more concrete they can add? --tim
Tokenizers and java.text.BreakIterator
Hi, Was wondering if anyone uses java.text.BreakIterator#getWordInstance(Locale) as a tokenizer for various languages? Does it do a good job? It seems like it does, at least for languages where words are separated by spaces or punctuation, but I have only done simple tests. Anyone have any thoughts on this? What am I missing? Does this seem like a valid approach? Thanks, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
On Jul 20, 2004, at 2:10 PM, John Wang wrote: I have already provided my opinion on this one - I think it would be fine to allow Token to be public. I'll let others respond to the additional requests you've made. Great, what processes need to be in place before this gets in the code base? You're doing the right thing. Although codebase details are most appropriate for the lucene-dev list. And filing issues in Bugzilla ensures your requests do not get lost e-mail inboxes. At this point, Lucene 1.4 has been released and Doug has put forth a proposal for Lucene 2.0 (with a migration path of a version 1.9 intermediate release). I'm not sure when the best time is to make this change. We should put API changes to a VOTE on the lucene-dev list though. In fact, I'll post a VOTE for Token now! :) Then they should speak up :) Well, I AM speaking up. So have some other people in earlier emails. But alike me, are getting ignored. You are not being ignored - not at all. Look at the replies you've gotten already. The HayStack changes were needed specifically due to the fact that many classes are declared to be final and not extensible. Did they post their changes back? Did they discuss them here? I do not recall such discussions (although see above about being lost in e-mail inboxes - mine is swamped beyond belief). Are there Bugzilla issues with their patches? Making things extensible for no good reason is asking for maintenance troubles later when you need more control internally. Lucene has been well designed from the start with extensibility only where it was needed in mind. It has evolved to be more open in very specific areas after careful consideration of the performance impact has been weighed. "Breaking" is not really the concern with extensibility, I don't think. Real-world use cases are needed to show that changes need to be made. I thought I gave many "real-world use cases" in the previous email. And evidently also applies to the Haystack project. What other information do we need to provide? I was not referring to your requests in my comment, but rather a general comment regarding requests to make things "public" when quite sufficient alternatives exist. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
That is what exactly they did and that's probably what I have to do. But that means we are diverging from the lucene code base and future fixes and enhancements need to be synchronized and that maybe a pain. -John On Tue, 20 Jul 2004 20:03:05 +0200, Daniel Naber <[EMAIL PROTECTED]> wrote: > On Tuesday 20 July 2004 18:12, John Wang wrote: > > > They make sure during deployment their "versions" > > gets loaded before the same classes in the lucene .jar. > > I don't see why people cannot just make their own lucene.jar. Just remove > the "final" and recompile. Finally, Lucene is Open Source. > > Regards > Daniel > > -- > http://www.danielnaber.de > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
On Tue, 20 Jul 2004 13:40:28 -0400, Erik Hatcher <[EMAIL PROTECTED]> wrote: > On Jul 20, 2004, at 12:12 PM, John Wang wrote: > > There are few things I want to do to be able to customize lucene: > > > [...] > > > > 3) to be able to customize analyzers to add more information to the > > Token while doing tokenization. > > I have already provided my opinion on this one - I think it would be > fine to allow Token to be public. I'll let others respond to the > additional requests you've made. Great, what processes need to be in place before this gets in the code base? > > > Oleg mentioned about the HayStack project. In the HayStack source > > code, they had to modifiy many lucene class to make them non-final in > > order to customzie. They make sure during deployment their "versions" > > gets loaded before the same classes in the lucene .jar. It is > > cumbersome, but it is a Lucene restriction they had to live with. > > Wow - I didn't realize that they've made local changes. Did they post > with requests for opening things up as you have? Did they submit > patches with their local changes? > > > I believe there are many other users feel the same way. > > Then they should speak up :) Well, I AM speaking up. So have some other people in earlier emails. But alike me, are getting ignored. The HayStack changes were needed specifically due to the fact that many classes are declared to be final and not extensible. > > > If I write some classes that derives from the lucene API and it > > breaks, then it is my responsibility to fix it. I don't understand why > > it would add burden to the Lucene developers. > > Making things extensible for no good reason is asking for maintenance > troubles later when you need more control internally. Lucene has been > well designed from the start with extensibility only where it was > needed in mind. It has evolved to be more open in very specific areas > after careful consideration of the performance impact has been weighed. > "Breaking" is not really the concern with extensibility, I don't > think. Real-world use cases are needed to show that changes need to be > made. I thought I gave many "real-world use cases" in the previous email. And evidently also applies to the Haystack project. What other information do we need to provide? I don't want to diverge from the Lucene codebase like Haystack has done. But I may not have a choice. Thanks -John > >Erik > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: join two indexes
On Tuesday 20 July 2004 19:19, Sergio wrote: > i want to join two lucene indexes but i dont know how to do that. There are two "addIndexes" methods in IndexWriter which you can use to write your own small merge tool (a ready-to-use tool for index merging doesn't exist AFAIK). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Very slow IndexReader.open() performance
Optimization should not require huge amounts of memory. Can you tell a bit more about your configuration: What JVM? What OS? How many fields? What mergeFactor have you used? Also, please attach the output of 'ls -l' of your index directory, as well as the stack trace you see when OutOfMemory is thrown. Thanks, Doug Mark Florence wrote: Hi -- We have a large index (~4m documents, ~14gb) that we haven't been able to optimize for some time, because the JVM throws OutOfMemory, after climbing to the maximum we can throw at it, 2gb. In fact, the OutOfMemory condition occurred most recently during a segment merge operation. maxMergeDocs was set to the default, and we seem to have gotten around this problem by setting it to some lower value, currently 100,000. The index is highly interactive so I took the hint from earlier posts to set it to this value. Good news! No more OutOfMemory conditions. Bad news: now, calling IndexReader.open() is taking 20+ seconds, and it is killing performance. I followed the design pattern in another earlier post from Doug. I take a batch of deletes, open an IndexReader, perform the deletes, then close it. Then I take a batch of adds, open an IndexWriter, perform the adds, then close it. Then I get a new IndexSearcher for searching. But because the index is so interactive, this sequence repeats itself all the time. My question is, is there a better way? Performance was fine when I could optimize. Can I hold onto singleton a IndexReader/IndexWriter/IndexSearcher to avoid the overhead of the open? Any help would be most gratefully received. Mark Florence, CTO, AIRS [EMAIL PROTECTED] 800-897-7714x1703 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
On Tuesday 20 July 2004 18:12, John Wang wrote: > They make sure during deployment their "versions" > gets loaded before the same classes in the lucene .jar. I don't see why people cannot just make their own lucene.jar. Just remove the "final" and recompile. Finally, Lucene is Open Source. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Very slow IndexReader.open() performance
Hi -- We have a large index (~4m documents, ~14gb) that we haven't been able to optimize for some time, because the JVM throws OutOfMemory, after climbing to the maximum we can throw at it, 2gb. In fact, the OutOfMemory condition occurred most recently during a segment merge operation. maxMergeDocs was set to the default, and we seem to have gotten around this problem by setting it to some lower value, currently 100,000. The index is highly interactive so I took the hint from earlier posts to set it to this value. Good news! No more OutOfMemory conditions. Bad news: now, calling IndexReader.open() is taking 20+ seconds, and it is killing performance. I followed the design pattern in another earlier post from Doug. I take a batch of deletes, open an IndexReader, perform the deletes, then close it. Then I take a batch of adds, open an IndexWriter, perform the adds, then close it. Then I get a new IndexSearcher for searching. But because the index is so interactive, this sequence repeats itself all the time. My question is, is there a better way? Performance was fine when I could optimize. Can I hold onto singleton a IndexReader/IndexWriter/IndexSearcher to avoid the overhead of the open? Any help would be most gratefully received. Mark Florence, CTO, AIRS [EMAIL PROTECTED] 800-897-7714x1703 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
On Jul 20, 2004, at 12:12 PM, John Wang wrote: There are few things I want to do to be able to customize lucene: [...] 3) to be able to customize analyzers to add more information to the Token while doing tokenization. I have already provided my opinion on this one - I think it would be fine to allow Token to be public. I'll let others respond to the additional requests you've made. Oleg mentioned about the HayStack project. In the HayStack source code, they had to modifiy many lucene class to make them non-final in order to customzie. They make sure during deployment their "versions" gets loaded before the same classes in the lucene .jar. It is cumbersome, but it is a Lucene restriction they had to live with. Wow - I didn't realize that they've made local changes. Did they post with requests for opening things up as you have? Did they submit patches with their local changes? I believe there are many other users feel the same way. Then they should speak up :) If I write some classes that derives from the lucene API and it breaks, then it is my responsibility to fix it. I don't understand why it would add burden to the Lucene developers. Making things extensible for no good reason is asking for maintenance troubles later when you need more control internally. Lucene has been well designed from the start with extensibility only where it was needed in mind. It has evolved to be more open in very specific areas after careful consideration of the performance impact has been weighed. "Breaking" is not really the concern with extensibility, I don't think. Real-world use cases are needed to show that changes need to be made. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Here is how to search multiple indexes
Here is the code that I use to do multi-index searches: // create a multi index searcher IndexSearcher indexes[] = new IndexSearcher[n]; // where n is the number of indexes to search for (int i = 0; i < n; i++) { // use whichever IndexSearcher constructor you want // blah is the appropriate value to pass indexes[i] = new IndexSearcher(blah); } // This is the part which allows you to search multiple indexes Searcher searcher = new MultiSearcher(indexes); // do the search Analyzer analyzer = new StandardAnalyzer(); Query query = QueryParser.parse(expression, colSearch, analyzer); searcher.search(query); At 01:19 PM 20/07/2004, you wrote: Hi, i want to join two lucene indexes but i dont know how to do that. For example i have a student index and a school index. In the scholl index i have the studentId field. How to do that ? Any idea will be wellcomed. Thx, Sergio. Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Syntax of Query
Hey guys, Need some help with creating a query. Here is the scenario: Field 1: Field 2: Field 3: MultiSelect 1 : MultiSelect 2 : What would the query look like if the condition is at any time there will be one entry from field 1, 2, or 3 and few entries from MultiSelect1 and few entries from MultiSelect. Would it look something like +field1 +(val11 OR val12 OR val14) +(val21 OR val23 OR val24) Thanks for all you guys support. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
join two indexes
Hi, i want to join two lucene indexes but i dont know how to do that. For example i have a student index and a school index. In the scholl index i have the studentId field. How to do that ? Any idea will be wellcomed. Thx, Sergio.
Re: No change in the indexing time after increase the merge factor
All Lucene articles that I know of were written before IndexWriter.minMergeDocs was added. Check IndexWriter javadoc for more info, but this is another field you can tune. Otis --- Praveen Peddi <[EMAIL PROTECTED]> wrote: > I performed lucene indexing with 25,000 documents. > We feel that indexing is slow, so I am trying to tune it. > My configuration is as follow: > Machine: Windows XP, 1GB RAM, 3GHz > # of documents: 25,000 > App Server: Weblogic 7.0 > lucene version: lucene 1.4 final > > I ran the indexer with merge factor of 10 and 50. Both times, the > total indexing time (lucene time only) is almost the same (27.92 mins > for mergefactor=10 and 28.11 mins for mergefactor=50). > > From the lucene mails and lucene related articles I read, I thought > increasing merge factor will imporve the performance of indexing. Am > I wrong? > > > Praveen > > > ** > Praveen Peddi > Sr Software Engg, Context Media, Inc. > email:[EMAIL PROTECTED] > Tel: 401.854.3475 > Fax: 401.861.3596 > web: http://www.contextmedia.com > ** > Context Media- "The Leader in Enterprise Content Integration" > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
Hi Daniel: There are few things I want to do to be able to customize lucene: 1) to be able to plug in a different similarity model (e.g. bayesian, vector space etc.) 2) to be able to store certain fields in its own format and provide corresponding readers. I may not want to store every field in the lexicon/inverted index structure. I may have fields that doesn't make sense to store the position or frequency information. 3) to be able to customize analyzers to add more information to the Token while doing tokenization. Oleg mentioned about the HayStack project. In the HayStack source code, they had to modifiy many lucene class to make them non-final in order to customzie. They make sure during deployment their "versions" gets loaded before the same classes in the lucene .jar. It is cumbersome, but it is a Lucene restriction they had to live with. I believe there are many other users feel the same way. If I write some classes that derives from the lucene API and it breaks, then it is my responsibility to fix it. I don't understand why it would add burden to the Lucene developers. Thanks -John On Tue, 20 Jul 2004 17:56:26 +0200, Daniel Naber <[EMAIL PROTECTED]> wrote: > On Tuesday 20 July 2004 17:28, John Wang wrote: > > >I have asked to make the Lucene API less restrictive many many many > > times but got no replies. > > I suggest you just change it in your source and see if it works. Then you can > still explain what exactly you did and why it's useful. From the developers > point-of-view having things non-final means more stuff is exposed and making > changes is more difficult (unless one accepts that derived classes may break > with the next update). > > Regards > Daniel > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Post-sorted inverted index?
You can define a subclass of FilterIndexReader that re-sorts documents in TermPositions(Term) and document(int), then use IndexWriter.addIndexes() to write this in Lucene's standard format. I have done this in Nutch, with the (as yet unused) IndexOptimizer. http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/indexer/IndexOptimizer.java?view=markup Doug Aphinyanaphongs, Yindalon wrote: I gather from reading the documentation that the scores for each document hit are computed at query time. I have an application that, due to the complexity of the function, cannot compute scores at query time. Would it be possible for me to store the documents in pre-sorted order in the inverted index? (i.e. after the initial index is created, to have a post processing step to sort and reindex the final documents). For example: Document A - score 0.2 Document B - score 0.4 Document C - score 0.6 Thus for the word 'the', the stored order in the index would be C,B,A. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
On Tuesday 20 July 2004 17:28, John Wang wrote: >I have asked to make the Lucene API less restrictive many many many > times but got no replies. I suggest you just change it in your source and see if it works. Then you can still explain what exactly you did and why it's useful. From the developers point-of-view having things non-final means more stuff is exposed and making changes is more difficult (unless one accepts that derived classes may break with the next update). Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
No change in the indexing time after increase the merge factor
I performed lucene indexing with 25,000 documents. We feel that indexing is slow, so I am trying to tune it. My configuration is as follow: Machine: Windows XP, 1GB RAM, 3GHz # of documents: 25,000 App Server: Weblogic 7.0 lucene version: lucene 1.4 final I ran the indexer with merge factor of 10 and 50. Both times, the total indexing time (lucene time only) is almost the same (27.92 mins for mergefactor=10 and 28.11 mins for mergefactor=50). >From the lucene mails and lucene related articles I read, I thought increasing merge >factor will imporve the performance of indexing. Am I wrong? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- "The Leader in Enterprise Content Integration"
lucene cutomized indexing
Hi: I am trying to store some Databased like field values into lucene. I have my own way of storing field values in a customized format. I guess my question is wheather we can make the Reader/Writer classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes non-final? I have asked to make the Lucene API less restrictive many many many times but got no replies. Is this request feasible? Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The indexer
On Jul 20, 2004, at 10:07 AM, Ian McDonnell wrote: As for indexing data from mysql - there have been lots of discussions of that recently, so check the archives. Basically you read the data, and index it with Lucene's API. And you are responsible for keeping it >in sync. The problem i am having is reading the data from the sql tables and then using the indexer to store it. Has anybody indexed from a mysql table before? If so, do i need to create some kind of JDBC query that selects all the field values from the table and indexes them in a lucene document that is stored on the server? If i do this, how can this process be automated rather than manually running the program everytime a new profile is added via the jsp form? How you get the data from your database is really up to you. Some folks here may be able to offer some advice, but ultimately it is specific to your application and business process. Once you have the data, via some query (again, this is up to you how you do it) you use Lucene's IndexWriter, create new Document's, add Field's to them, add the document to the writer, then close the writer. That's all there is to indexing a document with Lucene. As for automation - again this is up to your application but certainly you can interact with a Lucene index from your application so that it is not a manual separate indexing step. Erik, i'm not sure what you mean about keeping the db in sync. Are you talking about stale or updated db entries? You need to ensure that when data changes, the index is updated to reflect those changes. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The indexer
On Jul 20, 2004, at 9:29 AM, Ian McDonnell wrote: Basically i add details about a movie clip as various fields in an sql db using a jsp form. When the form submits i want to add the details into the db and also want the fields to be stored as a searchable lucene index on the server. Is this possible? Of course. But you'll have to code it. It's only a few lines of code to index a "document" into a Lucene index, but it is up to you to code those into the appropriate spot in your system (most likely right where you insert into mysql). Erik Ian --- Erik Hatcher <[EMAIL PROTECTED]> wrote: On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote: Can Lucenes indexer be used to store info in fields in a mysql db? I'm not quite clear on your question. You want to store a Lucene index (aka Directory) within mysql? Or, you want to index data from your existing mysql database into a Lucene index? A Directory implementation for Berkeley DB was created by the Chandler project and contributed to the Lucene sandbox (see Lucene's website for details on the sandbox and how to get to it). There has been some efforts to put a Lucene index into SQL Server, I believe, but I haven't seen mention of that in a while. It *can* be done, but I'm skeptical of the performance hit of adding in a relational database layer - and to do it well would certainly be non-trivial. As for indexing data from mysql - there have been lots of discussions of that recently, so check the archives. Basically you read the data, and index it with Lucene's API. And you are responsible for keeping it in sync. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine & Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The indexer
Yeah that last part of your reply seems to be what i'm trying to do(you're going to have to excuse me as i'm a total newbie to Lucene and am only finding my feet with it). I searched the archives and went back through it manually just there, but didnt find any relevant posts in the archive. >As for indexing data from mysql - there have been lots of discussions >of that recently, so check the archives. Basically you read the data, >and index it with Lucene's API. And you are responsible for keeping it >in sync. The problem i am having is reading the data from the sql tables and then using the indexer to store it. Has anybody indexed from a mysql table before? If so, do i need to create some kind of JDBC query that selects all the field values from the table and indexes them in a lucene document that is stored on the server? If i do this, how can this process be automated rather than manually running the program everytime a new profile is added via the jsp form? Erik, i'm not sure what you mean about keeping the db in sync. Are you talking about stale or updated db entries? Ian Ian Erik --- Erik Hatcher <[EMAIL PROTECTED]> wrote: On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote: > Can Lucenes indexer be used to store info in fields in a mysql db? I'm not quite clear on your question. You want to store a Lucene index (aka Directory) within mysql? Or, you want to index data from your existing mysql database into a Lucene index? A Directory implementation for Berkeley DB was created by the Chandler project and contributed to the Lucene sandbox (see Lucene's website for details on the sandbox and how to get to it). There has been some efforts to put a Lucene index into SQL Server, I believe, but I haven't seen mention of that in a while. It *can* be done, but I'm skeptical of the performance hit of adding in a relational database layer - and to do it well would certainly be non-trivial. As for indexing data from mysql - there have been lots of discussions of that recently, so check the archives. Basically you read the data, and index it with Lucene's API. And you are responsible for keeping it in sync. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine & Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The indexer
Basically i add details about a movie clip as various fields in an sql db using a jsp form. When the form submits i want to add the details into the db and also want the fields to be stored as a searchable lucene index on the server. Is this possible? Ian --- Erik Hatcher <[EMAIL PROTECTED]> wrote: On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote: > Can Lucenes indexer be used to store info in fields in a mysql db? I'm not quite clear on your question. You want to store a Lucene index (aka Directory) within mysql? Or, you want to index data from your existing mysql database into a Lucene index? A Directory implementation for Berkeley DB was created by the Chandler project and contributed to the Lucene sandbox (see Lucene's website for details on the sandbox and how to get to it). There has been some efforts to put a Lucene index into SQL Server, I believe, but I haven't seen mention of that in a while. It *can* be done, but I'm skeptical of the performance hit of adding in a relational database layer - and to do it well would certainly be non-trivial. As for indexing data from mysql - there have been lots of discussions of that recently, so check the archives. Basically you read the data, and index it with Lucene's API. And you are responsible for keeping it in sync. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine & Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The indexer
On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote: Can Lucenes indexer be used to store info in fields in a mysql db? I'm not quite clear on your question. You want to store a Lucene index (aka Directory) within mysql? Or, you want to index data from your existing mysql database into a Lucene index? A Directory implementation for Berkeley DB was created by the Chandler project and contributed to the Lucene sandbox (see Lucene's website for details on the sandbox and how to get to it). There has been some efforts to put a Lucene index into SQL Server, I believe, but I haven't seen mention of that in a while. It *can* be done, but I'm skeptical of the performance hit of adding in a relational database layer - and to do it well would certainly be non-trivial. As for indexing data from mysql - there have been lots of discussions of that recently, so check the archives. Basically you read the data, and index it with Lucene's API. And you are responsible for keeping it in sync. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
The indexer
Can Lucenes indexer be used to store info in fields in a mysql db? If so can anybody point me to an example or some documentation relating to it. Ian _ Sign up for FREE email from SpinnersCity Online Dance Magazine & Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Post-sorted inverted index?
On Jul 20, 2004, at 1:27 AM, Aphinyanaphongs, Yindalon wrote: I gather from reading the documentation that the scores for each document hit are computed at query time. I have an application that, due to the complexity of the function, cannot compute scores at query time. Would it be possible for me to store the documents in pre-sorted order in the inverted index? (i.e. after the initial index is created, to have a post processing step to sort and reindex the final documents). For example: Document A - score 0.2 Document B - score 0.4 Document C - score 0.6 Thus for the word 'the', the stored order in the index would be C,B,A. Lucene 1.4 includes a Sort facility - look at the additional IndexSearcher.search() methods for details. By default, if the scores computed are identical, the results are then ordered by document id, which is the insertion order. I hope this helps. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query across multiple fields scenario not handled by "MultiFieldQueryParser"
Daniel, > > Does anybody here know which changes I > > would have to make to QueryParser.jj to get the functionality described? > > I haven't tried it but I guess you need to change the getXXXQuery() methods so > they return a BooleanQuery. For example, getFieldQuery currently might return > a TermQuery; you'll need to change that so it returns a BooleanQuery with two > TermQuerys. These two queries would have the same term, but a different > field. > > Another approach is to leave QueryParser alone and modify the query after it > has been parsed by recursively iterating over the parsed query, replacing > e.g. TermQuerys with BooleanQuerys (just like described above). many thanks for your advice. Although I was hoping not to have to implement the change (as it has apparently been done), I guess this is enough to get me going. Thomas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]