Re: WildcardQuery and SpanQuery
Thanks for the quick response Paul =) However I am lost while looking at the surround package. Are you suggesting I can solve my problem at hand using the surround package? On 7/18/07, Paul Elschot <[EMAIL PROTECTED]> wrote: On Wednesday 18 July 2007 05:58, Cedric Ho wrote: > Hi everybody, > > We recently need to support wildcard search terms "*", "?" together > with SpanQuery. It seems that there's no SpanWildcardQuery available. > After looking into the lucene source code for a while, I guess we can > either: > > 1. Use SpanRegexQuery, or > > 2. Write our own SpanWildcardQuery, and implements the > rewrite(IndexReader) method to rewrite the query into a SpanOrQuery > with some SpanTermQuery. > > Of the two approaches, Option 1 seems to be easier. But I am rather > concerned about the performance of using regular expression. On the > other hand, I am not sure if there are any other concerns I am not > aware of for option 2 (i.e. is there a reason why there's no > SpanWildcardQuery in the first place?) > > Any advices ? The basic problem you are facing is that in Lucene the expansion of the terms is tightly coupled to the generation of a combination query using the expanded terms. In contrib/surround the term expansion and query generation are decoupled using a visitor pattern for the terms. The code is here: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query In surround a wild card term can provide either an OR of normal term queries, or a SpanOrQuery of span term queries. This query generation is in class SimpleTerm, which has one method for a normal boolean OR query over the terms, and one for a span query for the terms. In both cases surround uses a regular expression to expand the matching terms, but that could be changed to use another wildcard expansion mechanisms than the ones in SrndPrefixQuery and SrndTruncQuery, which are subclasses of SimpleTerm. With the term expansion and the query combination split, it is also necessary to limit the maximum number of expanded terms in another way than Lucene does. In surround the classes BasicQueryFactory and TooManyBasicQueries are used for that. Regards, Paul Elschot > > Cedric > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardQuery and SpanQuery
You could give this a shot (From my Qsol query parser): package com.mhs.qsol.spans; /** * Copyright 2006 Mark Miller ([EMAIL PROTECTED]) * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.ArrayList; import java.util.Collection; import java.util.Set; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.spans.SpanOrQuery; import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.search.spans.Spans; /** * @author mark miller * */ public class SpanWildcardQuery extends SpanQuery { private Term term; private BooleanQuery rewrittenWildQuery; public SpanWildcardQuery(Term term) { this.term = term; } public Term getTerm() { return term; } public Query rewrite(IndexReader reader) throws IOException { WildcardQuery wildQuery = new WildcardQuery(term); rewrittenWildQuery = (BooleanQuery) wildQuery.rewrite(reader); BooleanQuery bq = (BooleanQuery) rewrittenWildQuery.rewrite(reader); BooleanClause[] clauses = bq.getClauses(); SpanQuery[] sqs = new SpanQuery[clauses.length]; for (int i = 0; i < clauses.length; i++) { BooleanClause clause = clauses[i]; TermQuery tq = (TermQuery) clause.getQuery(); sqs[i] = new SpanTermQuery(tq.getTerm()); sqs[i].setBoost(tq.getBoost()); } SpanOrQuery query = new SpanOrQuery(sqs); query.setBoost(wildQuery.getBoost()); return query; } public Spans getSpans(IndexReader reader) throws IOException { throw new UnsupportedOperationException( "Query should have been rewritten"); } public String getField() { return term.field(); } /** * @deprecated use extractTerms instead * @see #extractTerms(Set); */ public Collection getTerms() { Collection terms = new ArrayList(); terms.add(term); return terms; } public void extractTerms(Set terms) { terms.add(term); } public String toString(String field) { StringBuffer buffer = new StringBuffer(); buffer.append("spanWildcardQuery("); buffer.append(term); buffer.append(")"); // buffer.append(ToStringUtils.boost(getBoost())); return buffer.toString(); } } Cedric Ho wrote: Hi everybody, We recently need to support wildcard search terms "*", "?" together with SpanQuery. It seems that there's no SpanWildcardQuery available. After looking into the lucene source code for a while, I guess we can either: 1. Use SpanRegexQuery, or 2. Write our own SpanWildcardQuery, and implements the rewrite(IndexReader) method to rewrite the query into a SpanOrQuery with some SpanTermQuery. Of the two approaches, Option 1 seems to be easier. But I am rather concerned about the performance of using regular expression. On the other hand, I am not sure if there are any other concerns I am not aware of for option 2 (i.e. is there a reason why there's no SpanWildcardQuery in the first place?) Any advices ? Cedric - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query in lucene
Witch analyser I have to use to find text like this ''?
Re: Query in lucene
When in doubt, WhitespaceAnalyzer is the most predictable. Note that it doesn't lower-case the tokens though. Depending upon your requirements, you can always pre-process your query and indexing streams and do your own lowercasing and/or character stripping. You can always create your own analyzer with the building blocks provided via Filters and Tokenizers Erick. On 7/18/07, WATHELET Thomas <[EMAIL PROTECTED]> wrote: Witch analyser I have to use to find text like this ''?
Re: Does Index have a Tokenizer Built into it
Is there a way to know how big to make the array before hand (how many terms are in the topic total?). I'm worried about the efficiency of this, since I'd have to rebuild every document that is a "hit" on the fly to make a snippet for each "hit" on the page (say 10 a page). Now I have to wonder how storing the termPosition vectors in the index + sorting them by position compares to storing the location of the document + using a tokenizer on the document. Both in the end give me the result I want. Any opinions? --JP On 7/18/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : After indexing I have been able to retrieve the TermPositionVector from the : index and it has all of the data, but I cannot find a way where given a : position I can retrieve the term at that position. Which is how I was hoping : to create my contextual snippets. there is no easy way to go from a position to a term -- coincidently there is a very recent thread on this on java-dev... http://www.nabble.com/Best-Practices-for-getting-Strings-from-a-position-range-tf4084187.html ...a new API may come out of it, but in the mean time you may be interested in taking the approach the current highlighter uses (as mentioned in that thread), of using the TermPositionVector to rebuild the orriginal tokenstream, then skipping ahead to the positions you are interested in. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardQuery and SpanQuery
On Wednesday 18 July 2007 12:30, Cedric Ho wrote: > Thanks for the quick response Paul =) > > However I am lost while looking at the surround package. That is not really surprising, the code is factored to the bone, and it is hardly documented. You could have a look at the test code to start. Also the surround.txt file in the contrib/surround directory should be helpful. > Are you > suggesting I can solve my problem at hand using the surround package? In case the surround syntax fits what you need, you might use the surround package. You could also use your own parser and target the o.a.l.queryParser.surround.query package. The code posted by Mark Miller may solve your problem, too. Regards, Paul Elschot > > > On 7/18/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > On Wednesday 18 July 2007 05:58, Cedric Ho wrote: > > > Hi everybody, > > > > > > We recently need to support wildcard search terms "*", "?" together > > > with SpanQuery. It seems that there's no SpanWildcardQuery available. > > > After looking into the lucene source code for a while, I guess we can > > > either: > > > > > > 1. Use SpanRegexQuery, or > > > > > > 2. Write our own SpanWildcardQuery, and implements the > > > rewrite(IndexReader) method to rewrite the query into a SpanOrQuery > > > with some SpanTermQuery. > > > > > > Of the two approaches, Option 1 seems to be easier. But I am rather > > > concerned about the performance of using regular expression. On the > > > other hand, I am not sure if there are any other concerns I am not > > > aware of for option 2 (i.e. is there a reason why there's no > > > SpanWildcardQuery in the first place?) > > > > > > Any advices ? > > > > The basic problem you are facing is that in Lucene > > the expansion of the terms is tightly coupled to the generation > > of a combination query using the expanded terms. > > > > In contrib/surround the term expansion and query generation > > are decoupled using a visitor pattern for the terms. The code is here: > > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query > > > > In surround a wild card term can provide either an OR of > > normal term queries, or a SpanOrQuery of span term queries. > > This query generation is in class SimpleTerm, which has one method > > for a normal boolean OR query over the terms, and one for > > a span query for the terms. > > > > In both cases surround uses a regular expression to expand > > the matching terms, but that could be changed to use > > another wildcard expansion mechanisms than the ones in > > SrndPrefixQuery and SrndTruncQuery, which > > are subclasses of SimpleTerm. > > > > With the term expansion and the query combination split, > > it is also necessary to limit the maximum number of expanded > > terms in another way than Lucene does. In surround the > > classes BasicQueryFactory and TooManyBasicQueries are > > used for that. > > > > Regards, > > Paul Elschot > > > > > > > > > > > > Cedric > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene shows parts of search query as a HIT
Hey folks, I am a new Lucene user , I used the following after indexing: search(searcher, "W. Chan Kim"); Lucene showed me hits of documents where "channel" word existed. Notice that "Chan" is a part of "Channel" . How do I stop this ? I am keen to find the exact word. I used the following, before the search method: IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(), true); writer.addDocument (createDocument(item,words)); writer.optimize(); writer.close(); searcher = new IndexSearcher(indexPath); thanks ! AZ
lucene version?
Is there a way to test as to which version of Lucene was used to build an index? -Akanksha - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shows parts of search query as a HIT
Are you sure that the hit wasn't on "w" or "kim"? The default for searching is OR... I recommend that you get a copy of Luke (google lucene luke) which allows you to examine your index as well as see how queries parse using various analyzers. It's an invaluable tool... Best Erick On 7/18/07, Askar Zaidi <[EMAIL PROTECTED]> wrote: Hey folks, I am a new Lucene user , I used the following after indexing: search(searcher, "W. Chan Kim"); Lucene showed me hits of documents where "channel" word existed. Notice that "Chan" is a part of "Channel" . How do I stop this ? I am keen to find the exact word. I used the following, before the search method: IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(), true); writer.addDocument (createDocument(item,words)); writer.optimize(); writer.close(); searcher = new IndexSearcher(indexPath); thanks ! AZ
Re: lucene version?
I don't think this is stored in the index. I think the closest you can get is the "format" of the segments_N file which changes every time the index file format changes. That at least lets you narrow it down possibly to a single release if the file format is changing frequently (eg it has in the past 2 releases). There's no public API to read the format. You could instead make your own class, in package org.apache.lucene.index, that implements a method similar to how the SegmentInfos.readCurrentVersion(...) method is implemented, but just returns the format instead. Mike "Akanksha Baid" <[EMAIL PROTECTED]> wrote: > Is there a way to test as to which version of Lucene was used to build > an index? > > -Akanksha > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shows parts of search query as a HIT
Hey Guys, I just checked my Lucene results. It shows a document with the word hit "change" when I am searching for "Chan", and it considers that as a hit. Is there a way to stop this and show just the exact word match ? I started using Lucene yesterday, so I am fairly new ! thanks AZ On 7/18/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Are you sure that the hit wasn't on "w" or "kim"? The default for searching is OR... I recommend that you get a copy of Luke (google lucene luke) which allows you to examine your index as well as see how queries parse using various analyzers. It's an invaluable tool... Best Erick On 7/18/07, Askar Zaidi <[EMAIL PROTECTED]> wrote: > > Hey folks, > > I am a new Lucene user , I used the following after indexing: > > search(searcher, "W. Chan Kim"); > > Lucene showed me hits of documents where "channel" word existed. Notice > that > "Chan" is a part of "Channel" . How do I stop this ? > > I am keen to find the exact word. > > I used the following, before the search method: > > IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(), > true); > > writer.addDocument > (createDocument(item,words)); > writer.optimize(); > writer.close(); > searcher = new IndexSearcher(indexPath); > > thanks ! > > AZ >
Dictionary Type Lookup
Hi, I am trying to model a Dictionary Type Search in Lucene. My approach was this - Load the dictionary file ( words & their meanings ) and index each dictionary term and associated meaning as a Lucene Document. - Use IndexReader's term method to peek at the index and get the TermEnum. TermEnum' next() return's the next term. The snippet looks like this TermEnum browseTermEnum = indexReader.terms(new Term(browseIndex, browsableTerm)); while( browseTermEnum.next()){ System.out.println(browseTermEnum.term().text()) } This works fine, and i can fetch next 'n' terms. The only problem i see with this route is, i can't get the previous terms !!! 1. Is there a way to get previous terms from TermEnum ? 2. Is there a better way to model Dictionary Type lookup in Lucene ? Appreciate your suggestions ? Thanks Murali V -- View this message in context: http://www.nabble.com/Dictionary-Type-Lookup-tf4107251.html#a11679841 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermEnum - previous() method ?
Hi All, I searched in this forum for anybody looking for need for previous() method in TermEnum. I found only this link http://www.nabble.com/How-to-navigate-through-indexed-terms-tf28148.html#a189225 Would it be possible to implement previous() method ? I know i am asking for quick solution here ;) Just i want to ensure if it not implemented, there might be a reason. So i can consider alternates approaches to implement similar feature.. appreciate your thoughts... Thanks Murali V -- View this message in context: http://www.nabble.com/TermEnumprevious%28%29-method---tf4107296.html#a11679947 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MoreLikeThis
I am using Lucene 2.1.0 and want to use MoreLikeThis for querying documents. I understand that the jar file for the same is in contrib. I have the contrib folder extracted, but am not sure what to do from this point on. What jar file am I looking for and where should put it. I am using Eclipse. If someone could please point me to some directions for the same , that would be a big help. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MoreLikeThis
You can put lucene-queries-2.2.0.jar on your class path or your Eclipse project build path. That's all you need. Jay Akanksha Baid wrote: I am using Lucene 2.1.0 and want to use MoreLikeThis for querying documents. I understand that the jar file for the same is in contrib. I have the contrib folder extracted, but am not sure what to do from this point on. What jar file am I looking for and where should put it. I am using Eclipse. If someone could please point me to some directions for the same , that would be a big help. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MoreLikeThis
Right , I was making a silly mistake there. I have it working now. Thanks for the reply. yu wrote: You can put lucene-queries-2.2.0.jar on your class path or your Eclipse project build path. That's all you need. Jay Akanksha Baid wrote: I am using Lucene 2.1.0 and want to use MoreLikeThis for querying documents. I understand that the jar file for the same is in contrib. I have the contrib folder extracted, but am not sure what to do from this point on. What jar file am I looking for and where should put it. I am using Eclipse. If someone could please point me to some directions for the same , that would be a big help. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
StandardTokenizer is slowing down highlighting a lot
Hi all, I was tracking down slowness in the contrib highlighter code and it seems the seemingly simple tokenStream.next() is the culprit. I've seen multiple posts about this being a possible cause. Has anyone looked into how to speed up StandardTokenizer? For my documents it's taking about 70ms per document that's a big ugh! I was thinking I might just cache the TermVectors in memory if that will be faster. Anyone have another approach to solving this problem? -M
Re: StandardTokenizer is slowing down highlighting a lot
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. You should first see if you can get away without it and use a different Analyzer, or if you can re-implement just the functionality you need in a custom Analyzer. Do you really need the support for abbreviations, companies, email address, etc? If so: You can use the TokenSources class in the highlighter package to rebuild a TokenStream without re-analyzing if you store term offsets and positions in the index. I have not found this to be super beneficial, even when using the StandardAnalyzer to re-analyze, but it certainly could be faster if you have large enough documents. Your best bet is probably to use https://issues.apache.org/jira/browse/LUCENE-644, which is a non-positional Highlighter that finds offsets to highlight by looking up query term offset information in the index. For larger documents this can be much faster than using the standard contrib Highlighter, even if your using TokenSources. LUCENE-644 has a much flatter curve than the contrib Highlighter as document size goes up. - Mark Michael Stoppelman wrote: Hi all, I was tracking down slowness in the contrib highlighter code and it seems the seemingly simple tokenStream.next() is the culprit. I've seen multiple posts about this being a possible cause. Has anyone looked into how to speed up StandardTokenizer? For my documents it's taking about 70ms per document that's a big ugh! I was thinking I might just cache the TermVectors in memory if that will be faster. Anyone have another approach to solving this problem? -M - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer is slowing down highlighting a lot
Might be nice to add a line of documentation to the highlighter on the possible performance hit if one uses StandardAnalyzer which probably is a common case. Thanks for the speedy response. -M On 7/18/07, Mark Miller <[EMAIL PROTECTED]> wrote: Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. You should first see if you can get away without it and use a different Analyzer, or if you can re-implement just the functionality you need in a custom Analyzer. Do you really need the support for abbreviations, companies, email address, etc? If so: You can use the TokenSources class in the highlighter package to rebuild a TokenStream without re-analyzing if you store term offsets and positions in the index. I have not found this to be super beneficial, even when using the StandardAnalyzer to re-analyze, but it certainly could be faster if you have large enough documents. Your best bet is probably to use https://issues.apache.org/jira/browse/LUCENE-644, which is a non-positional Highlighter that finds offsets to highlight by looking up query term offset information in the index. For larger documents this can be much faster than using the standard contrib Highlighter, even if your using TokenSources. LUCENE-644 has a much flatter curve than the contrib Highlighter as document size goes up. - Mark Michael Stoppelman wrote: > Hi all, > > I was tracking down slowness in the contrib highlighter code and it seems > the seemingly simple tokenStream.next() is the culprit. > I've seen multiple posts about this being a possible cause. Has anyone > looked into how to speed up StandardTokenizer? For my > documents it's taking about 70ms per document that's a big ugh! I was > thinking I might just cache the TermVectors in memory if > that will be faster. Anyone have another approach to solving this > problem? > > -M > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Inrease the performance of Indexing in Lucene
Hi, Please help me. Its been a month since i am trying lucene. My requirements are huge, i have to index and search in TB of data. I have question regarding three topics: 1. Problem in Indexing As i need to index TB of data, so by googling and visiting different forum i deployed following fashion for indexing: 1. First i created an array of RAMDirectory then i added the documents on it. After crossing certain threshold i dumped it into my drive as tempIndex1 2. I repeated the same process until all documents are indexed in my drive as tempindex1. tempindex2... 3. Then finally i loaded the temp directories and merged in as a main full indexed directories. 4. I have used threading too for this purpose. 5. This some what removed the optimize() overhead of IndexWriter, as i added directories together only at the end. Am i doing this the right way or not, is there any other solution to boost the indexing process. 2. Problem in searching As lucene doesnt support LSI and SVD so as to achieve conceptual search, i first search the lucene index for the user inputted text then retrieved the document and then expanded the query using LSI and SVD and then re-searched the index. Now with few words in query doesnt seem to have performance problem but when i expand the query i.e. when the query contains ten words ORed together then it takes tremendously unacceptable amount of time to get Hits. Is this obvious or am i missing something here too.. What are the ways to achieve boost in query performance when the query contains many terms and especially they are ORed, as for ANDed query it requires less time to produce Hits. I have used single Indexsearcher and my index is optimized as well.. 3. Another Problem As i require to dump my database table too inside lucene along with fulltext info. What effect will it have on indexing and searching.? Also i might need to change the name of Field of Document indexed in lucene, will it be possible.? I know its not possible to change the value of the field but will it be possible to change the name of the field or we have to control externally..? Please shade me some light in these things: Your help is highly anticipated -- View this message in context: http://www.nabble.com/Inrease-the-performance-of-Indexing-in-Lucene-tf4108165.html#a11682360 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]