Re: Did you mean for multiple terms
On Thursday 04 March 2004 17:55, [EMAIL PROTECTED] wrote: Consider the query +michael +jackson not to return any hits because there's no michael in index, but there is jackson (e.g. janet...). Is there any reasonable approach how to determine whether one or multiple terms of a query - and which - do let the query fail? In order to illustrate, google for george buhs - it will suggest george bush. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 20:56, Erik Hatcher wrote: On Feb 16, 2004, at 9:50 AM, [EMAIL PROTECTED] wrote: TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new StringReader(doc.getField(contents).stringValue())); The field is the field name. No built-in analyzers use it, but custom analyzers could key off of it to do field-specific analysis. Look at If I want to tokenize all Fields I would have to get a tokenStream of each Field seperately and process them seperately? Or can I get one master stream that compounds all Fields? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 17, 2004, at 6:53 AM, [EMAIL PROTECTED] wrote: On Monday 16 February 2004 20:56, Erik Hatcher wrote: On Feb 16, 2004, at 9:50 AM, [EMAIL PROTECTED] wrote: TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new StringReader(doc.getField(contents).stringValue())); The field is the field name. No built-in analyzers use it, but custom analyzers could key off of it to do field-specific analysis. Look at If I want to tokenize all Fields I would have to get a tokenStream of each Field seperately and process them seperately? Or can I get one master stream that compounds all Fields? You would do them separately. I'm not clear on what you are trying to do. The Analyzer does all this during indexing automatically for you, but it sounds like you are just trying to emulate what an Analyzer already does to extract words from text? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Tuesday 17 February 2004 15:18, Erik Hatcher wrote: You would do them separately. I'm not clear on what you are trying to do. The Analyzer does all this during indexing automatically for you, but it sounds like you are just trying to emulate what an Analyzer already does to extract words from text? I am still doing this: TokenStream in = analyzer.tokenStream(contents, new StringReader(reader.document(i).getField(contents).stringValue())); And I want to extract all words from all Fields. Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 17, 2004, at 9:58 AM, [EMAIL PROTECTED] wrote: On Tuesday 17 February 2004 15:18, Erik Hatcher wrote: You would do them separately. I'm not clear on what you are trying to do. The Analyzer does all this during indexing automatically for you, but it sounds like you are just trying to emulate what an Analyzer already does to extract words from text? I am still doing this: TokenStream in = analyzer.tokenStream(contents, new StringReader(reader.document(i).getField(contents).stringValue())); And I want to extract all words from all Fields. The words (or terms) are already in the index ready to be read very rapidly and accurately. IndexReader is what you want to investigate if your fields are indexed. Look into IndexReader and pull the terms directly rather than re-analyzing the text. Provided contents was an indexed field, you could do something like this (taken from a mini-project I'm tinkering with right now): public String[] wordsThatStartWith(char c) throws IOException { String letter = new String( + c).toLowerCase(); ArrayList words = new ArrayList(); if (reader == null) { reader = IndexReader.open(indexPath); } TermEnum terms = reader.terms(new Term(word, letter)); while (word.equals(terms.term().field())) { String word = terms.term().text(); if (word.startsWith(letter)) { words.add(word); } else { break; } if (!terms.next()) { break; } } Collections.sort(words); String[] sortedWords = (String[]) words.toArray(new String[0]); return sortedWords; } You'll need to do some adapting of this code to your environment and field(s), as what is here is designed to pull all the word's that start with a specified letter. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Tuesday 17 February 2004 16:13, Erik Hatcher wrote: The words (or terms) are already in the index ready to be read very rapidly and accurately. IndexReader is what you want to investigate if your fields are indexed. Look into IndexReader and pull the terms directly rather than re-analyzing the text. Provided contents was an indexed field, you Well, but my index was created using a GermanAnalyzer. I have to re-analyze it with WhitespaceAnalyzer if I don't want the words to be truncated... What you do is what I did at the beginning of the thread :-) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 17, 2004, at 11:39 AM, [EMAIL PROTECTED] wrote: On Tuesday 17 February 2004 16:13, Erik Hatcher wrote: The words (or terms) are already in the index ready to be read very rapidly and accurately. IndexReader is what you want to investigate if your fields are indexed. Look into IndexReader and pull the terms directly rather than re-analyzing the text. Provided contents was an indexed field, you Well, but my index was created using a GermanAnalyzer. I have to re-analyze it with WhitespaceAnalyzer if I don't want the words to be truncated... What you do is what I did at the beginning of the thread :-) *arg* I feel like we are going in circles here. Why use the GermanAnalyzer at all if it is not what you want? Re-index! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Tuesday 17 February 2004 18:05, Erik Hatcher wrote: *arg* I feel like we are going in circles here. Me, too :-) Why use the GermanAnalyzer at all if it is not what you want? Re-index! I want to use the GermanAnalyzer. But not for the did you mean functionality... That's what this thread is all about :) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 18:35, Viparthi, Kiran (AFIS) wrote: As mentioned the only way I can see is to get the output of the analyzer directly as a TokenStream iterate through it and insert it into a Map. Could you provide or point me to some example code on how to get and use TokenStream. The API docs are somewhat unclear to me... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Did you mean...
Hi Timo, I was mentioning to your previous code that you can collect all the text from term. IndexReader reader = IndexReader.open(ram); TermEnum te = reader.terms(); StringBuffer sb = new StringBuffer(); while(te.next()) { Term t = te.term(); sb.append(t.text()); } And you can get the tokens using StringTokenizer on the sb.toString() and put them into Map by calculating the occurrences. As mentioned I didn't use any information from index so I didn't uses any TokenStream but let me check it out. Kiran -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 16 February 2004 11:38 To: Lucene Users List Subject: Re: Did you mean... On Thursday 12 February 2004 18:35, Viparthi, Kiran (AFIS) wrote: As mentioned the only way I can see is to get the output of the analyzer directly as a TokenStream iterate through it and insert it into a Map. Could you provide or point me to some example code on how to get and use TokenStream. The API docs are somewhat unclear to me... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote: As mentioned I didn't use any information from index so I didn't uses any TokenStream but let me check it out. deprecated: String description = doc.getField(contents).stringValue(); final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); for (Token token; (token = in.next()) != null; ) { System.out.println(token.termText()); } But the result is the same, the words are actually truncated (instead of has, had, have, etc. only ha) :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote: On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote: As mentioned I didn't use any information from index so I didn't uses any TokenStream but let me check it out. deprecated: String description = doc.getField(contents).stringValue(); What is the value of description here? final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); And what analyzer are you using here? for (Token token; (token = in.next()) != null; ) { System.out.println(token.termText()); } But the result is the same, the words are actually truncated (instead of has, had, have, etc. only ha) :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 12:40, Erik Hatcher wrote: On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote: String description = doc.getField(contents).stringValue(); What is the value of description here? ? The value of the field contents :-) Long, plain text.. final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); And what analyzer are you using here? GermanAnalyzer (yes, has, had, etc. below is fictional but most people here probably don't speak german...e.g. automobile may become automob or something like this). But the result is the same, the words are actually truncated (instead of has, had, have, etc. only ha) :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 16, 2004, at 7:59 AM, [EMAIL PROTECTED] wrote: On Monday 16 February 2004 12:40, Erik Hatcher wrote: On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote: String description = doc.getField(contents).stringValue(); What is the value of description here? ? The value of the field contents :-) Long, plain text.. I'm asking for specifics because you listed a specific truncation problem. final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); And what analyzer are you using here? GermanAnalyzer (yes, has, had, etc. below is fictional but most people here probably don't speak german...e.g. automobile may become automob or something like this). And thus the nature of the problem. Try using the WhitespaceAnalyzer instead to see what you get. Erik But the result is the same, the words are actually truncated (instead of has, had, have, etc. only ha) :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 15:16, Erik Hatcher wrote: And thus the nature of the problem. Try using the WhitespaceAnalyzer instead to see what you get. Much better! :-) But sometimes it still returns multiple words as a single term...:-\ And it does not care for punctuation, but that's probably something I'll have to do on my own... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 15:27, [EMAIL PROTECTED] wrote: But sometimes it still returns multiple words as a single term...:-\ Sorry, silly mistake of mine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 12:12, [EMAIL PROTECTED] wrote: deprecated: String description = doc.getField(contents).stringValue(); final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); for (Token token; (token = in.next()) != null; ) { System.out.println(token.termText()); } Can somebody explain tokenStream() to me? This is not deprecated: TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new StringReader(doc.getField(contents).stringValue())); But what is the first argument (field) for tokenStream() good for? Actually I can type whatever I want...? Don't understand the short description in the API docs... Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 15:16, Erik Hatcher wrote: And thus the nature of the problem. Try using the WhitespaceAnalyzer instead to see what you get. Can I chain multiple analyzer in order to filter common stop words? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 16, 2004, at 9:50 AM, [EMAIL PROTECTED] wrote: Can somebody explain tokenStream() to me? You are now venturing under the covers of Lucene's API. This is where I give the sage advice to get the Lucene source code and surf around it a bit. (It helps to have a nice IDE where you can click around classes and see the object hierarchy easily) TokenStream is used by the Analyzer to split text into terms. TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new StringReader(doc.getField(contents).stringValue())); But what is the first argument (field) for tokenStream() good for? Actually I can type whatever I want...? Don't understand the short description in the API docs... The field is the field name. No built-in analyzers use it, but custom analyzers could key off of it to do field-specific analysis. Look at the PerFieldAnalyzerWrapper to make per-field analysis easier than writing a custom one that keys off the field name. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 16, 2004, at 10:34 AM, [EMAIL PROTECTED] wrote: On Monday 16 February 2004 15:16, Erik Hatcher wrote: And thus the nature of the problem. Try using the WhitespaceAnalyzer instead to see what you get. Can I chain multiple analyzer in order to filter common stop words? You cannot chain Analyzers per se, but you can easily write a custom analyzer that does chaining of all of the various operations like stop word removal, lower-casing, stemming, filtering, etc. Have a peek at the source code of your favorite analyzers to get an idea of how these are built - and you will see how simply they are put together. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Did you mean...
Hi, We archived this by creating a separate index words extracting the complete list of words. You can also work on the frequency if you are extracting these from other indexes but could be expensive. Manipulating the search for doing a fuzzy search in the words index would give you the better list of matching words for spellings. Kiran. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 12 February 2004 08:48 To: Lucene Users List Subject: Re: Did you mean... On Thursday 12 February 2004 00:15, Matt Tucker wrote: We implemented that type of system using a spelling engine by Wintertree: http://www.wintertree-software.com There are some free Java spelling packages out there too that you could likely use. But this does not ensure that the word really exists in the index. The word google does propose however to exist. Regards Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
Hi Timo! There is no built-in way in Lucene to achieve this. I have done a simple implementation with a patched FuzzyQuery for each term. A new method (bestOrderRewrite) returns a ordered list of all fuzzy terms that indeed exist in index. There is no guarantee that the suggested term is spelled correct though... Basically this works best when search is done on one term only. Search code is something like this: // on no_hits cquery = cquery.rewrite(reader); if (cquery instanceof TermQuery) { // if search contained only one term this is a TermQuery instance FuzzyQuery fquery = new FuzzyQuery(new Term(contents, cquery.toString(contents))); TermQuery[] terms = fquery.bestOrderRewrite(reader); if (terms.length 0) { StringBuffer alts = new StringBuffer(); alts.append(Did you mean ).append(terms[0].getTerm().text()); } } else if (cquery instanceof BooleanQuery) { // split queries // snip BooleanClause[] clauses = ((BooleanQuery)cquery).getClauses(); // get suggestion for each term if (clauses[i].required) { FuzzyQuery fquery = new FuzzyQuery(new Term(contents, clauses[i].query.toString(contents))); TermQuery[] terms = fquery.bestOrderRewrite(reader); // ... } // /snip } // and so on... Regards, Ronnie On Thursday 12 February 2004 00:15, Matt Tucker wrote: We implemented that type of system using a spelling engine by Wintertree: http://www.wintertree-software.com There are some free Java spelling packages out there too that you could likely use. But this does not ensure that the word really exists in the index. The word google does propose however to exist. Regards Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
Hi Ronnie! On Thursday 12 February 2004 09:50, [EMAIL PROTECTED] wrote: There is no built-in way in Lucene to achieve this. I have done a simple implementation with a patched FuzzyQuery for each term. A new method (bestOrderRewrite) returns a ordered list of all fuzzy terms that indeed exist in index. There is no guarantee that the suggested term is spelled Could you please post your FuzzyQuery (did you pach the class or extend it?) or send via email? Thanks a lot Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
How about creating spellcheck dictionary with all words in lucene index? That way you ensure that the word really exists in the index. - Original Message - From: [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, February 11, 2004 11:48 PM Subject: Re: Did you mean... On Thursday 12 February 2004 00:15, Matt Tucker wrote: We implemented that type of system using a spelling engine by Wintertree: http://www.wintertree-software.com There are some free Java spelling packages out there too that you could likely use. But this does not ensure that the word really exists in the index. The word google does propose however to exist. Regards Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 12, 2004, at 16:42, Abhay Saswade wrote: How about creating spellcheck dictionary with all words in lucene index? That way you ensure that the word really exists in the index. You can indeed use the terms identified by Lucene as the dictionary words ands apply traditional spell checking tricks like phonetic encodings, Levinstein distance and so on. This approach works reasonably well in practice. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 09:43, Viparthi, Kiran (AFIS) wrote: We archived this by creating a separate index words extracting the complete list of words. How were you extracting the words? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Did you mean...
Hi Timo, As we just deal with a small and limited KAON Ontology. I should say we use a crude way using StringTokenizer searching for And maintaining a unique list. But I assume that there could be other better ways if you are getting them from another index. Kiran. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 12 February 2004 17:54 To: Lucene Users List Subject: Re: Did you mean... On Thursday 12 February 2004 09:43, Viparthi, Kiran (AFIS) wrote: We archived this by creating a separate index words extracting the complete list of words. How were you extracting the words? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 18:03, [EMAIL PROTECTED] wrote: On Thursday 12 February 2004 17:53, [EMAIL PROTECTED] wrote: How were you extracting the words? Oops, sorry that this stupid question :) Got it. Hm, seems the question wasn't so stupid anyway: IndexReader reader = IndexReader.open(ram); TermEnum te = reader.terms(); while(te.next()) { ... But this includes obviously parts of words, too :-\ Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
Timo, We implemented that type of system using a spelling engine by Wintertree: http://www.wintertree-software.com There are some free Java spelling packages out there too that you could likely use. Regards, Matt [EMAIL PROTECTED] wrote: Hi! Can I do things like Google's Did you mean...? correction for mistyped words with Lucene? Warm Regards, Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 00:15, Matt Tucker wrote: We implemented that type of system using a spelling engine by Wintertree: http://www.wintertree-software.com There are some free Java spelling packages out there too that you could likely use. But this does not ensure that the word really exists in the index. The word google does propose however to exist. Regards Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]