MergerIndex + Searchables
Hi Guys Apologies... I have several MERGERINDEXES [ MGR1,MGR2,MGR3]. for searching across these MERGERINDEXES I use the following Code IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK]; for(int all=0;allCNTINDXDBOOK;all++){ indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]); System.out.println(all + ADDED TO SEARCHABLES + INDEXEDBOOKS[all]); } MultiSearcher searcher = new MultiSearcher(indexToSearch); Question : When on Search Process , How to Display that this relevan Document Id Originated from Which MRG??? [ Some thing like this : - Search word 'ISBN12345' is avalible from MRGx ] WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Synonyms for AND/OR/NOT operators
Hi! What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. Thank you for your attention! Sanyi __ Do you Yahoo!? Send holiday email and support a worthy cause. Do good. http://celebrity.mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergerIndex + Searchables
As obvious as it may seem, you could always store the index ID in which you are indexing the document in the document itself and have that fetched with the search results, or is there something stopping you from doing that. Nader Henein Karthik N S wrote: Hi Guys Apologies... I have several MERGERINDEXES [ MGR1,MGR2,MGR3]. for searching across these MERGERINDEXES I use the following Code IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK]; for(int all=0;allCNTINDXDBOOK;all++){ indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]); System.out.println(all + ADDED TO SEARCHABLES + INDEXEDBOOKS[all]); } MultiSearcher searcher = new MultiSearcher(indexToSearch); Question : When on Search Process , How to Display that this relevan Document Id Originated from Which MRG??? [ Some thing like this : - Search word 'ISBN12345' is avalible from MRGx ] WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergerIndex + Searchables
Karthik, On Tuesday 21 December 2004 09:04, Karthik N S wrote: Hi Guys Apologies... I have several MERGERINDEXES [ MGR1,MGR2,MGR3]. for searching across these MERGERINDEXES I use the following Code IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK]; for(int all=0;allCNTINDXDBOOK;all++){ indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]); System.out.println(all + ADDED TO SEARCHABLES + INDEXEDBOOKS[all]); } MultiSearcher searcher = new MultiSearcher(indexToSearch); Question : When on Search Process , How to Display that this relevan Document Id Originated from Which MRG??? [ Some thing like this : - Search word 'ISBN12345' is avalible from MRGx ] I think you are looking for the methods subSearcher() and subDoc() on MultiSearcher. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. If the synonym mappings are static and you don't mind a larger index, adding them during indexing avoids the complexity of rewriting the query. Injecting synonyms during querying allows the synonym mappings to change dynamically, though does produce more complex queries. Here's an example you'll find with the source code distribution of Lucene in Action which uses WordNet to look up synonyms. Erik p.s. I'm sensitive to over-marketing Lucene in Action in this forum as it would bother me to constantly see an advertisement. You can be sure that any mentions of it from me will coincide with concrete examples (which are freely available) that are directly related to questions being asked. % ant -emacs SynonymAnalyzerViewer Buildfile: build.xml check-environment: compile: build-test-index: build-perf-index: prepare: SynonymAnalyzerViewer: Using a custom SynonymAnalyzer, two fixed strings are analyzed with the results displayed. Synonyms, from the WordNet database, are injected into the same positions as the original words. See the Analysis chapter for more on synonym injection and position increments. The Tools and extensions chapter covers the WordNet feature found in the Lucene sandbox. Press return to continue... Running lia.analysis.synonym.SynonymAnalyzerViewer... 1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly] [promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile] 2: [brown] [brownness] [brownish] 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger] [discombobulate] [confuse] [confound] [befuddle] [bedevil] 4: [jumps] 5: [over] [o] [across] 6: [lazy] [faineant] [indolent] [otiose] [slothful] 7: [dogs] 1: [oh] 2: [we] 3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat] [become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch] [cause] [come] [commence] [contract] [convey] [develop] [draw] [drive] [dumbfound] [engender] [experience] [father] [fetch] [find] [fix] [flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] [let] [make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce] [puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy] [stupify] [suffer] [sustain] [take] [trounce] [undergo] 4: [both] 5: [kinds] 6: [country] [state] [nationality] [nation] [land] [commonwealth] [area] 7: [western] [westerly] 8: [bb] BUILD SUCCESSFUL Total time: 10 seconds - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Hi! I think we're talking about different things. My question is about using synonyms for AND/OR/NOT operators, not about synonyms of words in the index. For example, in some language: AND = AANNDD; OR = OORR; NOT = NNOOTT So, the user can enter: (cat OR kitty) AND black AND tail and either: (cat OORR kitty) AANNDD black AANNDD tail Both sets of operators must work. It must be some kind of a query parser modification/parametering, so there is nothing to do with the index. I hope I was more specific now ;) Thanx, Sanyi --- Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. If the synonym mappings are static and you don't mind a larger index, adding them during indexing avoids the complexity of rewriting the query. Injecting synonyms during querying allows the synonym mappings to change dynamically, though does produce more complex queries. Here's an example you'll find with the source code distribution of Lucene in Action which uses WordNet to look up synonyms. Erik p.s. I'm sensitive to over-marketing Lucene in Action in this forum as it would bother me to constantly see an advertisement. You can be sure that any mentions of it from me will coincide with concrete examples (which are freely available) that are directly related to questions being asked. % ant -emacs SynonymAnalyzerViewer Buildfile: build.xml check-environment: compile: build-test-index: build-perf-index: prepare: SynonymAnalyzerViewer: Using a custom SynonymAnalyzer, two fixed strings are analyzed with the results displayed. Synonyms, from the WordNet database, are injected into the same positions as the original words. See the Analysis chapter for more on synonym injection and position increments. The Tools and extensions chapter covers the WordNet feature found in the Lucene sandbox. Press return to continue... Running lia.analysis.synonym.SynonymAnalyzerViewer... 1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly] [promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile] 2: [brown] [brownness] [brownish] 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger] [discombobulate] [confuse] [confound] [befuddle] [bedevil] 4: [jumps] 5: [over] [o] [across] 6: [lazy] [faineant] [indolent] [otiose] [slothful] 7: [dogs] 1: [oh] 2: [we] 3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat] [become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch] [cause] [come] [commence] [contract] [convey] [develop] [draw] [drive] [dumbfound] [engender] [experience] [father] [fetch] [find] [fix] [flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] [let] [make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce] [puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy] [stupify] [suffer] [sustain] [take] [trounce] [undergo] 4: [both] 5: [kinds] 6: [country] [state] [nationality] [nation] [land] [commonwealth] [area] 7: [western] [westerly] 8: [bb] BUILD SUCCESSFUL Total time: 10 seconds - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Dress up your holiday email, Hollywood style. Learn more. http://celebrity.mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Erik Hatcher writes: On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. I guess you missunderstood the question. I think he want's to know how to create a query parser understanding something like 'a UND b' as well as 'a AND b' to support localized operator names (german in this case). AFAIK that can only be done by copying query parsers javacc-source and adding the operators there. Shouldn't be difficult, though it's a bit ugly since it implies code duplication. And there will be no way of choosing the operators dynamically at runtime. One will need to have different query parsers for different languages. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Wow, I really did misunderstand. My apologies. Yes, you will need to fork QueryParser.jj and install JavaCC to build your custom parser. It should be pretty trivial to add alternatives to AND(+)/OR/NOT(-). Erik On Dec 21, 2004, at 4:42 AM, Sanyi wrote: Hi! I think we're talking about different things. My question is about using synonyms for AND/OR/NOT operators, not about synonyms of words in the index. For example, in some language: AND = AANNDD; OR = OORR; NOT = NNOOTT So, the user can enter: (cat OR kitty) AND black AND tail and either: (cat OORR kitty) AANNDD black AANNDD tail Both sets of operators must work. It must be some kind of a query parser modification/parametering, so there is nothing to do with the index. I hope I was more specific now ;) Thanx, Sanyi --- Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. If the synonym mappings are static and you don't mind a larger index, adding them during indexing avoids the complexity of rewriting the query. Injecting synonyms during querying allows the synonym mappings to change dynamically, though does produce more complex queries. Here's an example you'll find with the source code distribution of Lucene in Action which uses WordNet to look up synonyms. Erik p.s. I'm sensitive to over-marketing Lucene in Action in this forum as it would bother me to constantly see an advertisement. You can be sure that any mentions of it from me will coincide with concrete examples (which are freely available) that are directly related to questions being asked. % ant -emacs SynonymAnalyzerViewer Buildfile: build.xml check-environment: compile: build-test-index: build-perf-index: prepare: SynonymAnalyzerViewer: Using a custom SynonymAnalyzer, two fixed strings are analyzed with the results displayed. Synonyms, from the WordNet database, are injected into the same positions as the original words. See the Analysis chapter for more on synonym injection and position increments. The Tools and extensions chapter covers the WordNet feature found in the Lucene sandbox. Press return to continue... Running lia.analysis.synonym.SynonymAnalyzerViewer... 1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly] [promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile] 2: [brown] [brownness] [brownish] 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger] [discombobulate] [confuse] [confound] [befuddle] [bedevil] 4: [jumps] 5: [over] [o] [across] 6: [lazy] [faineant] [indolent] [otiose] [slothful] 7: [dogs] 1: [oh] 2: [we] 3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat] [become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch] [cause] [come] [commence] [contract] [convey] [develop] [draw] [drive] [dumbfound] [engender] [experience] [father] [fetch] [find] [fix] [flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] [let] [make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce] [puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy] [stupify] [suffer] [sustain] [take] [trounce] [undergo] 4: [both] 5: [kinds] 6: [country] [state] [nationality] [nation] [land] [commonwealth] [area] 7: [western] [westerly] 8: [bb] BUILD SUCCESSFUL Total time: 10 seconds - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Dress up your holiday email, Hollywood style. Learn more. http://celebrity.mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene index files from two different applications.
Hi ! Have two applications. Both are supposed to write Lucene index files and the WebApplication is supposed to read these index files. Here are the questions: 1. Can two applications write index files, in the same directory, at the same time ? 2. If two applications cannot write index files, in the same directory, at the same time. How should we resolve this ? Would appriciate any solutions to this... 3. My thought is to write the index files in two different directories and read both the indexes (as though it forms a single index, search results should consider the documents in both the indexes) from the WebApplication. How to go about implementing this, using Lucene API ? Need inputs on which of the Lucene API's to use ? Thanks, Gururaja __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Lucene index files from two different applications.
Gururaja H wrote: Hi ! Have two applications. Both are supposed to write Lucene index files and the WebApplication is supposed to read these index files. Here are the questions: 1. Can two applications write index files, in the same directory, at the same time ? if you implement the synchronisation between these 2 applications, yes 2. If two applications cannot write index files, in the same directory, at the same time. How should we resolve this ? Would appriciate any solutions to this... ... se 1. and 3. 3. My thought is to write the index files in two different directories and read both the indexes (as though it forms a single index, search results should consider the documents in both the indexes) from the WebApplication. How to go about implementing this, using Lucene API ? Need inputs on which of the Lucene API's to use ? If yor requirements allow you to create to independent indices, than you can use the MultiSearcher to search in both indices. Maybe this will be the most cost effective solution in your case, Best, Sergiu Thanks, Gururaja __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene index files from two different applications.
On Dec 21, 2004, at 5:51 AM, Gururaja H wrote: 1. Can two applications write index files, in the same directory, at the same time ? If you mean to the same Lucene index, the answer is no. Only a single IndexWriter instance may be writing to an index at one time. 2. If two applications cannot write index files, in the same directory, at the same time. How should we resolve this ? Would appriciate any solutions to this... You may consider writing a queuing system so that two applications queue up a document to index, and a single indexer application reads from the queue. Or the applications could wait until the index is available for writing. Or... 3. My thought is to write the index files in two different directories and read both the indexes (as though it forms a single index, search results should consider the documents in both the indexes) from the WebApplication. How to go about implementing this, using Lucene API ? Need inputs on which of the Lucene API's to use ? Lucene can easily search from multiple indexes using MultiSearcher. This merges the results together as you'd expect. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Well, I guess I'd better recognize and replace the operator synonyms to their original format before passing them to QueryParser. I don't feel comfortable tampering with Lucene's source code. Anyway, thanx for the answers. Sanyi --- Morus Walter [EMAIL PROTECTED] wrote: Erik Hatcher writes: On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. I guess you missunderstood the question. I think he want's to know how to create a query parser understanding something like 'a UND b' as well as 'a AND b' to support localized operator names (german in this case). AFAIK that can only be done by copying query parsers javacc-source and adding the operators there. Shouldn't be difficult, though it's a bit ugly since it implies code duplication. And there will be no way of choosing the operators dynamically at runtime. One will need to have different query parsers for different languages. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Sanyi writes: Well, I guess I'd better recognize and replace the operator synonyms to their original format before passing them to QueryParser. I don't feel comfortable tampering with Lucene's source code. Apart from knowing how to compile lucene (including the javacc code generation) you should only need to change DEFAULT TOKEN : { AND: (AND | ) | OR:(OR | ||) | NOT: (NOT | !) to DEFAULT TOKEN : { AND: (AND | insert your version of and here | ) | OR:(OR | insert your version of or here | ||) | NOT: (NOT | insert your version of not here | !) in jakarta-lucene/src/java/org/apache/lucene/queryParser/QueryParser.jj Replacing the operators before query might be hard to do, if you want to handle cases like »a AND b OR c«, which is a query for a phrase a AND b or the token c, correctly. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing speed. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sorting on a field that can have null values (resend)
I sent this mail yesterday but had no luck in receiving responses. Trying it again . Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. Order by in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null pointer exception. Is this an expected behaviour? Is lucene always expecting some value on the sortable fields? I thought of putting empty strings instead of null values but I think empty strings are put first in the list while sorting which is the reverse of what anyone would want. Following is the exception I saw in the error log: java.lang.NullPointerException at org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36) at org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95) at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120) at org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47) at org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58) at org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130) at org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38) at org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125) at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64) at org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51) at org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41) If its a bug in lucene, Will it be fixed in next release? Any suggestions would be appreciated. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Lucene working with a DB
I read a lot of messages that Lucene can index a DB because it use that INPUTSTREAM type I don't understand how to do this. For example if I've a forum with Mysql and a lot of files on my web, for every search I've to select the index that I want use in my search, true? But I don't know how to do that Lucene writes an index about the information of the DB of forum (for example MySQL) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stopwords in phrases
I want to be able to use stopwords in exact phrase searches. I have looked at Nutch and used the same approach (replace common words with n-grams. Look at net.nutch.analysis.CommonGrams). So if to,be,or and not are stop words, for the string to be or not to be, the analyzer produces the following tokens [to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be, be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to, or-not-to-be, not-to, not-to-be, to-be] This is exactly what I wanted from the analyzer during indexing. But I'm having a problem with the search. when I do a search on not to be the analyzer is converting my search into content:not-to not-to-be to-be because the analyzer produces the tokens not-to,not-to-be,to-be I'm getting 0 results on this as there is no token not-to not-to-be to-be in the index. I want just not-to-be from the analyzer during the search so when I search on not to be I will get the document which has not-to-be as a token. How can I use the same analyzer to get different results in indexing and searching? Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene working with a DB
On Dec 21, 2004, at 10:39 AM, Daniel Cortes wrote: I read a lot of messages that Lucene can index a DB because it use that INPUTSTREAM type Where have you read that? This is incorrect. I don't understand how to do this. For example if I've a forum with Mysql and a lot of files on my web, for every search I've to select the index that I want use in my search, true? But I don't know how to do that Lucene writes an index about the information of the DB of forum (for example MySQL) To index data in a database into a Lucene index, you must write code that pulls the records from the database and adds them to a Lucene index, slicing into fields in whatever manner you need. You will want to be sure to update the index when your database changes by either removing, or updating (remove and re-add) documents. There is nothing built-in that will do these steps for you. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stopwords in phrases
On Dec 21, 2004, at 10:41 AM, Ravi wrote: I want to be able to use stopwords in exact phrase searches. I have looked at Nutch and used the same approach (replace common words with n-grams. Look at net.nutch.analysis.CommonGrams). So if to,be,or and not are stop words, for the string to be or not to be, the analyzer produces the following tokens [to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be, be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to, or-not-to-be, not-to, not-to-be, to-be] You've gone a bit beyond what Nutch is using. It creates bigrams, where you've expanded it to many more than that. Are you also using the position increment of 0 for the gram tokens like Nutch does? But I'm having a problem with the search. when I do a search on not to be the analyzer is converting my search into content:not-to not-to-be to-be because the analyzer produces the tokens not-to,not-to-be,to-be I'm getting 0 results on this as there is no token not-to not-to-be to-be in the index. I want just not-to-be from the analyzer during the search so when I search on not to be I will get the document which has not-to-be as a token. How can I use the same analyzer to get different results in indexing and searching? Nutch does some different stuff between indexing and parsing queries... [java] 1: [the:WORD] [the-quick:gram] [java] 2: [quick:WORD] [java] 3: [brown:WORD] [java] 4: [fox:WORD] [java] query = (+url:the quick brown^4.0) (+anchor:the quick brown^2.0) (+content:the-quick quick brown) The first four lines show the analysis of the quick brown fox. The last line is the resultant Lucene query for the quick brown. Notice that only the content field gets analyzed specially, and also that only gram tokens are considered in that field, not the WORD tokens if there is also a gram. Does this help with your situation? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene index files from two different applications.
Depending on what you are doing, there are some problems with MultiSearcher. See http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 for a description of the issues and possible patch(es) to fix. Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 21, 2004 3:09 AM To: Lucene Users List Subject: Re: Lucene index files from two different applications. On Dec 21, 2004, at 5:51 AM, Gururaja H wrote: 1. Can two applications write index files, in the same directory, at the same time ? If you mean to the same Lucene index, the answer is no. Only a single IndexWriter instance may be writing to an index at one time. 2. If two applications cannot write index files, in the same directory, at the same time. How should we resolve this ? Would appriciate any solutions to this... You may consider writing a queuing system so that two applications queue up a document to index, and a single indexer application reads from the queue. Or the applications could wait until the index is available for writing. Or... 3. My thought is to write the index files in two different directories and read both the indexes (as though it forms a single index, search results should consider the documents in both the indexes) from the WebApplication. How to go about implementing this, using Lucene API ? Need inputs on which of the Lucene API's to use ? Lucene can easily search from multiple indexes using MultiSearcher. This merges the results together as you'd expect. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene working with a DB
Hello I'll just paste the relevant MySQL code, you add the calls to it per your needs..it has no checking of anything so better add that as well... It's possible I didnt copy/paste everything but you should get the idea where this is going... -pedja -- import java.sql.*; import lucene stuff... public class sqlTest { public static void main(String[] args) throws Exception { String sTable = args[0]; String sThing = args[1]; String indexDir = /path/to/lucene/index; try { Analyzer analyzer = new StandardAnalyzer(); IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, false); addSQLDoc(fsWriter, sTable, sThing); fsWriter.close(); } catch (Exception e) { throw new Exception( caught a + e.getClass() + \n with message: + e.getMessage()); } } private void addSQLDoc(IndexWriter writer, String sqlTable, String somethingElse) throws Exception { String cs = jdbc:mysql://HOST/DATABASE?user=SQLUSERpassword=SQLPASSWORD; String sql= SELECT * FROM + sqlTable + WHERE something=\ + somethingElse + \; // establish a connection to MySQL database try { Class.forName(com.mysql.jdbc.Driver).newInstance(); } catch (Exception e) { System.out.println(Lucene: ERROR: Unable to load driver); e.printStackTrace(); } // get the record data... try { Connection conn = DriverManager.getConnection(cs); Statement Stmt = conn.createStatement(); ResultSet RS = Stmt.executeQuery(sql); while(RS.next()) { // make a new, empty document Document doc = new Document(); // get the database fields String field2 = RS.getString(1); String field2 = RS.getString(2); String field3 = RS.getString(3); String field4 = RS.getString(4); String field5 = RS.getString(5); // add the first group of fields // doc.add(Field.Keyword(FIELD1, field1)); doc.add(Field.Keyword(FIELD2, field2)); doc.add(Field.Keyword(FIELD3, field3)); doc.add(Field.Keyword(FIELD4, field4)); doc.add(Field.Text(FIELD5, field5)); // add the document writer.addDocument(doc); } catch (Exception e) { e.printStackTrace(); throw new Exception(); } } // close while(..) RS.close(); Stmt.close(); conn.close(); } catch(SQLException e) { throw new Exception(); } } } -- Daniel Cortes said the following on 12/21/2004 10:39 AM: I read a lot of messages that Lucene can index a DB because it use that INPUTSTREAM type I don't understand how to do this. For example if I've a forum with Mysql and a lot of files on my web, for every search I've to select the index that I want use in my search, true? But I don't know how to do that Lucene writes an index about the information of the DB of forum (for example MySQL) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
Thanks for the heads up. I'm using Lucene 1.4.2. I tried to do optimize() again but it has no effect. Adding a just tiny dummy document would get rid of it. I'm doing optimize every few hundred documents because I tried to simulate incremental update. This lead to another question I would post separately. Thanks. Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing speed. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how often to optimize?
Right now I am incrementally adding about 100 documents to the index a day and then optimize after that. I find that optimize essentially rebuilding the entire index into a single file. So the size of disk write is proportion to the total index size, not to the size of documents incrementally added. So my question is would it be an overkill to optimize everyday? Is there any guideline on how often to optimize? Every 1000 documents or more? Every week? Is there any concern if there are a lot of documents added without optimizing? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stopwords in phrases
Are you also using the position increment of 0 for the gram tokens like Nutch does? Yes. I don't think considering only gram tokens will work for me because Nutch uses only bi-grams. It can only have one gram per token. In my case I have more than one and even if I get only the grams, I still will have the same problem. Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how often to optimize?
Hello, I think some of these questions my be answered in the jGuru FAQ So my question is would it be an overkill to optimize everyday? Only if lots of documents are being added/deleted, and you end up with a lot of index segments. Is there any guideline on how often to optimize? Every 1000 documents or more? Are not optimized indices causing you any problems (e.g. slow searches, high number of open file handles)? If no, then you don't even need to optimize until those issues become... issues. Every week? Is there any concern if there are a lot of documents added without optimizing? Possibly, see my answer above. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[ANNOUNCE] dotLucene1.4.3 RC1 (port of Jakarta Lucene to C#)
Hi Folks, I am pleased to announce the availability of dotLucene 1.4.3 RC1 build-001 This is the first Release Candidate release of version 1.4.3 of Jakarta Lucene ported to C# and is intended to be Final. Please visit http://www.sourceforge.net/projects/dotlucene/ to learn more about dotLucene and to download the source code. Best regards, -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]