Re: QueryParser and compound words
On Thursday 13 March 2003 00:52, Magnus Johansson wrote: > Tatu Saloranta wrote: ... > >But same happens during indexing; fotbollsmatch should be properly > >split and stemmed to "fotboll" and "match" terms, right? > > Yes but the word fotbollsmatch was never indexed in this example. Only > the word fotboll. > I want a query for fotbollsmatch to match a document containing the word > fotboll. Ok I think I finally understand what you meant. :-) So, basically, in your case you would prefer getting query: fotbollsmatch to expand to (after stemming etc): fotboll match and not "fotboll match" So that matching just one of the words would be enough for a hit (either "either of" or "just first word" or "just last word"). It would be possible to implement this functionality by overriding default QueryParser and modifying its functionality slightly. In QueryParser you should be able to override default handling for terms, so that whenever you get just single token (in this case "fotbollsmatch") that expands to multiple Terms, you do not construct a phrase query, but just BooleanQuery with TermQueries (look at getFieldQuery(); it handles basic search terms). You may need to use simple heuristics for figuring when you have white space(s) that indicate "normal" phrases, which probably should still be handled using PhraseQuery. Of course this is all assuming you still do want that functionality. :-) And if you do, it would be good idea to get patch back in case someone else finds that useful later on (I think many non-english languages have concept of compound words; German and Finnish at least do). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser and compound words
Tatu Saloranta wrote: On Wednesday 12 March 2003 01:19, Magnus Johansson wrote: Well, the problem arise when a user enters a query with a compound word and the compound word itself is not indexed, only one of its parts. Yes, but neither is compound word itself ever user in query either, assuming same analyser is used (like it always should)? For example the index contains a document with the following word: fotboll (football). Let's say the users searches for fotbollsmatch (football game). The word is split into fotboll and match and the phrase "fotboll match" is searched for. The user finds no matching document. But same happens during indexing; fotbollsmatch should be properly split and stemmed to "fotboll" and "match" terms, right? Yes but the word fotbollsmatch was never indexed in this example. Only the word fotboll. I want a query for fotbollsmatch to match a document containing the word fotboll. Comparing this to english the user would have found a document, however scored slightly lower than a document containing both the words football and game. I agree with you that this might not be a problem. The user could be instructed to reformulate his query. However the behaviour for an english index and I actually think that if user has to be aware of internal stemming and reformulate query I think this would be bit of a problem. :-) But I'm not 100% sure search string would differ from indexed string, assuming same base token (unprocessed token, ie "fotbollsmatch") was both contained in the document and searched for using QueryParser. a swedish index would be different. I think that in general behaviour is heavily dependant on analyser (tokenizer + stemmer) being used, so it's probably different between most languages. I think I'll accept how it works now. It is perhaps unlikely that the user would query the index using a compound word and expecting documents containing only one of its parts in result. The more I think about it the more difficult it becomes to come up with a realistic example of why the behaviour would need to be changed. Thank you for your feedback /magnus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser and compound words
On Wednesday 12 March 2003 01:19, Magnus Johansson wrote: > Well, the problem arise when a user enters a query with a compound word > and the compound word itself is not indexed, only one of its parts. Yes, but neither is compound word itself ever user in query either, assuming same analyser is used (like it always should)? > For example the index contains a document with the following word: > fotboll (football). > > Let's say the users searches for fotbollsmatch (football game). The word > is split into fotboll and match and the phrase "fotboll match" is > searched for. > The user finds no matching document. But same happens during indexing; fotbollsmatch should be properly split and stemmed to "fotboll" and "match" terms, right? > Comparing this to english the user would have found a document, however > scored > slightly lower than a document containing both the words football and game. > > I agree with you that this might not be a problem. The user could be > instructed > to reformulate his query. However the behaviour for an english index and I actually think that if user has to be aware of internal stemming and reformulate query I think this would be bit of a problem. :-) But I'm not 100% sure search string would differ from indexed string, assuming same base token (unprocessed token, ie "fotbollsmatch") was both contained in the document and searched for using QueryParser. > a swedish > index would be different. I think that in general behaviour is heavily dependant on analyser (tokenizer + stemmer) being used, so it's probably different between most languages. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser and compound words
Well, the problem arise when a user enters a query with a compound word and the compound word itself is not indexed, only one of its parts. For example the index contains a document with the following word: fotboll (football). Let's say the users searches for fotbollsmatch (football game). The word is split into fotboll and match and the phrase "fotboll match" is searched for. The user finds no matching document. Comparing this to english the user would have found a document, however scored slightly lower than a document containing both the words football and game. I agree with you that this might not be a problem. The user could be instructed to reformulate his query. However the behaviour for an english index and a swedish index would be different. /magnus Tatu Saloranta wrote: On Tuesday 11 March 2003 03:05, Magnus Johansson wrote: Hello I have written an Analyzer for swedish. Compound words are common in swedish, therefore my Analyzer tries to split the compound words into its parts. For example the swedish word fotbollsmatch (football game) is split into fotboll and match. (same applies to many other languages so this is a common problem I think). However... I'm not sure why you consider this a problem? The reason quotes are added is that since a single token (as parsed by QueryParser) expands no multiple terms, it becomes a PhraseQuery. Same happen (should happen) during indexing, so end result should match word in both "normal" case (word is correctly spelled as compound word) and when word is (incorrectly) spelled with spaces? As to quotes; they are only shown when converting query to a String; internally there are no quotes to be matched. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser and compound words
On Tuesday 11 March 2003 03:05, Magnus Johansson wrote: > Hello > > I have written an Analyzer for swedish. Compound words are common in > swedish, therefore my Analyzer tries to split the compound words > into its parts. For example the swedish word fotbollsmatch (football > game) is split into fotboll and match. (same applies to many other languages so this is a common problem I think). However... I'm not sure why you consider this a problem? The reason quotes are added is that since a single token (as parsed by QueryParser) expands no multiple terms, it becomes a PhraseQuery. Same happen (should happen) during indexing, so end result should match word in both "normal" case (word is correctly spelled as compound word) and when word is (incorrectly) spelled with spaces? As to quotes; they are only shown when converting query to a String; internally there are no quotes to be matched. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser and compound words
Hello I have written an Analyzer for swedish. Compound words are common in swedish, therefore my Analyzer tries to split the compound words into its parts. For example the swedish word fotbollsmatch (football game) is split into fotboll and match. However when I use my Analyzer with the QueryParser the query footballsmatch is changed into "fotbolls match" (notice the quotes) when what I really want is the query fotbolls match (with no qoutes). Is this possible? The splitting of compound words is of no real use if I can't get rid of the qoutes. I have attached some sample code that illustrates the problem (using a dummy Analyzer that splits words larger than five charcters into two) /magnus -- import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Query; import java.io.Reader; import java.io.IOException; public class TestAnalyzer extends Analyzer { public TokenStream tokenStream(String s, Reader reader) { return new SplitStream(new StandardTokenizer(reader)); } public static void main(String[] args) throws Exception { QueryParser qp = new QueryParser("fieldname", new TestAnalyzer()); Query q = qp.parse("queryparser"); System.out.println("Query: " + q.toString("fieldname")); System.out.println("Correct: query parser"); } } class SplitStream extends TokenStream { private static final int SPLIT_SIZE = 5; private TokenStream tstream; private String buffer = null; private int start, end; public SplitStream(TokenStream tstream) { this.tstream = tstream; } public Token next() throws IOException { if (buffer == null) { Token tok = tstream.next(); if (tok == null) { return null; } else if (tok.termText().length() > SPLIT_SIZE) { buffer = tok.termText().substring(SPLIT_SIZE); start = tok.startOffset() + SPLIT_SIZE; end = tok.endOffset(); return new Token( tok.termText().substring(0, SPLIT_SIZE), tok.startOffset(), tok.startOffset() + SPLIT_SIZE); } else { return tok; } } else { Token t = new Token(buffer, start, end); buffer = null; return t; } } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]