Re: Question on how to build a query
Well, I seem to have gotten something to work. Maybe someone could just comment on my approach. I wrote my indexer so that it added each field without tokenizing it: Field fnameField = new Field(fname, fname.toLowerCase(), true, true, false); Field lnameField = new Field(lname, lname.toLowerCase(), true, true, false); Field cityField = new Field(city, position.toLowerCase(), true, true, false); By the way, if this is the case, is the indexer even using the analyzer that I pass to it? Then in my search code I create the firstname query as a WildcardQuery if the first name is provided (adding a * to the end if it's not already there): Term fnameTerm = null; Query fnameQuery = null; if( fnameIn.length() 0) { if( !fnameIn.endsWith(*) ) { fnameIn += *; } fnameTerm = new Term(fname, fnameIn); fnameQuery = new WildcardQuery(fnameTerm); } I then create my lastname query as either a WildcardQuery or a term query depending on whether it contains a *: Term lnameTerm = new Term(lname, lnameIn); Query lnameQuery = null; if( lnameIn.indexOf(*) != -1 ) { lnameQuery = new WildcardQuery(lnameTerm); } else { lnameQuery = new TermQuery(lnameTerm); } Lastly, I create the city query as a TermQuery. Finally, I add the 3 queries to a booleanQuery, not adding the first name query if it is null (this means a first name was not provided) and making lastname and city required: if(fnameQuery != null) { overallQuery.add(fnameQuery, true, false); } overallQuery.add(lnameQuery, true, false); overallQuery.add(positionQuery, true, false); I then search my index and it appears to work. I haven't tested it extensively yet, though. Does this seem like a reasonable way to approach this problem, or am I missing something that's going to bite me in the you-know-what? Thanks. Jason Jason St. Louis wrote: Hi everyone. I'm wondering if someone could help me out. I have created an index of a database of person records where I have created documents with the following fields: database primary_key (stored, not indexed) first name (indexed) last name (indexed) city (indexed) I used SimpleAnalyzer when creating the index. I am providing a web based form to search this index. The form has 3 fields for first name, last name and city (city is a drop down list). I want to take the users input and from these 3 fields and build a query such that: A)last name is mandatory and can be wildcarded (I will probably make sure the value begins with at least one letter) B)First name can be wildcarded (same as last name, although if it is left blank, I would probably just search the last_name and city and ignore the first name) C)city is mandatory and must match exactly How would I go about building this query? Do I create a wildcard query for first name and last name, a term query for city and then combine them into boolean query where all 3 terms must be matched? I kind of feel like I'm grasping at straws here. I think I just need a jumpstart to understand how the Query API works. Thanks. Jason - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
amusing interaction between advanced tokenizers and highlighter package
I've run across an amusing interaction between advanced Analyzers/TokenStreams and the very useful term highlighter: http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/ I have a custom Analyzer I'm using to index javadoc-generated web pages. The Analyzer in turn has a custom TokenStream which tries to more intelligently tokenize java-language tokens. A naive analyzer would turn something like SyncThreadPool into one token. Mine uses the great Lucene capability of Tokens being able to have a 0 position increment to turn it into the token stream: Sync (incr = 0) Thread (incr = 0) Pool (incr = 0) SyncThreadPool (incr = 1) [As an aside maybe it should also pair up the subtokens, so SyncThread and ThreadPool appear too]. The point behind this is someone searching for threadpool probably would want to see a match for SyncThreadPool even this is the evil leading-prefix case. With most other Analyzers and ways of forming a query this would be missed, which I think is anti-human and annoys me to no end. So the analyzer/tokenizer works great, and I have a demo site about to come up that indexes lots of publicly avail javadoc as a kind of resource so you can easily find what's already been done. The problem is as follows. In all cases I use my Analyzer to index the documents. If I use my Analyzer with with the Highligher package, it doesn't look at the position increment of the tokens and consequently a nonsense stream of matches is output. If I use a different Analyzer w/ the highlighter (say, the StandardAnalyzer), then it doesn't show the matches that really matched, as it doesn't see the subtokens. It might be the fix is for the Highlighter to look at the position increment of tokens and only pass by one if multiple ones have an incr of 0 and match one part of the query. Has this come up before and is the issue clear? thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re:amusing interaction between advanced tokenizers and highlighter package
Yes, this issue has come up before with other choices of analyzers. I think it should be fixable without changing any of the highlighter APIs - can you email me or post here the source to your analyzer? Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on how to build a query
On Jun 19, 2004, at 1:51 AM, Jason St. Louis wrote: I wrote my indexer so that it added each field without tokenizing it: Field fnameField = new Field(fname, fname.toLowerCase(), true, true, false); Field lnameField = new Field(lname, lname.toLowerCase(), true, true, false); Field cityField = new Field(city, position.toLowerCase(), true, true, false); By the way, if this is the case, is the indexer even using the analyzer that I pass to it? No. Tokenized fields are analyzed. Non-tokenized fields are left as-is. It might be clearer if you used Field.Keyword instead, which is identical to what you have here. Then in my search code I create the firstname query as a WildcardQuery if the first name is provided (adding a * to the end if it's not already there): Term fnameTerm = null; Query fnameQuery = null; if( fnameIn.length() 0) { if( !fnameIn.endsWith(*) ) { fnameIn += *; } fnameTerm = new Term(fname, fnameIn); fnameQuery = new WildcardQuery(fnameTerm); } I recommend PrefixQuery in this case. I presume you lowercased fnameIn? You should to get it to match what was indexed. Does this seem like a reasonable way to approach this problem, or am I missing something that's going to bite me in the you-know-what? Seems reasonable to me as long as you are lowercasing the strings at query time also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: amusing interaction between advanced tokenizers and highlighter package
On Jun 19, 2004, at 2:29 AM, David Spencer wrote: A naive analyzer would turn something like SyncThreadPool into one token. Mine uses the great Lucene capability of Tokens being able to have a 0 position increment to turn it into the token stream: Sync (incr = 0) Thread (incr = 0) Pool (incr = 0) SyncThreadPool (incr = 1) [As an aside maybe it should also pair up the subtokens, so SyncThread and ThreadPool appear too]. The point behind this is someone searching for threadpool probably would want to see a match for SyncThreadPool even this is the evil leading-prefix case. With most other Analyzers and ways of forming a query this would be missed, which I think is anti-human and annoys me to no end. There are indexing/querying solutions/workarounds to the leading-prefix issue, such as reversing the text as you index it and ensuring you do the same on queries so they match. There are some interesting techniques for this type of thing in the Managing Gigabytes book I'm currently reading, which Lucene could support with custom analysis and queries, I believe. The problem is as follows. In all cases I use my Analyzer to index the documents. If I use my Analyzer with with the Highligher package, it doesn't look at the position increment of the tokens and consequently a nonsense stream of matches is output. If I use a different Analyzer w/ the highlighter (say, the StandardAnalyzer), then it doesn't show the matches that really matched, as it doesn't see the subtokens. Are your subtokens marked with correct offset values? This probably doesn't relate to the problem you're seeing, but I'm curious. It might be the fix is for the Highlighter to look at the position increment of tokens and only pass by one if multiple ones have an incr of 0 and match one part of the query. Has this come up before and is the issue clear? The problem is clear, and I've identified this issue with my exploration of the Highlighter also. The Highlighter works well for the most common scenarios, but certainly doesn't cover all the bases. The majority of scenarios do not use multiple tokens in a single position. Also, it also doesn't currently handle the new SpanQuery family - although Highlighting spans would be quite cool. After learning how Highlighter works, I have a deep appreciation for the great work Mark put into it - it is well done. As for this issue, though, I think your solution sounds reasonable, although I haven't thought it through completely. Perhaps Mark can comment. If you do modify it to work for your case, it would be great to have your contribution rolled back in :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
stop words in index
Hi! How comes that stop words show up in index (HighFreqTerms)? Yes, I do you the same analyzer for indexing and searching. class SearchFacade { private final static String[] GERMAN_STOP_WORDS = new String[] { foo, bar }; private final static Analyzer GERMAN_ANALYZER = new SnowballAnalyzer( German2, GERMAN_STOP_WORDS ); public void index() { writer = new IndexWriter( Configuration.Lucene.INDEX, GERMAN_ANALYZER, true ); ... } public void search(String q) { final Query q = MultiFieldQueryParser.parse( query, new String[] { blah, foo, bar }, GERMAN_ANALYZER ); ... } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: amusing interaction between advanced tokenizers and highlighter package
[EMAIL PROTECTED] wrote: Yes, this issue has come up before with other choices of analyzers. I think it should be fixable without changing any of the highlighter APIs - can you email me or post here the source to your analyzer? Code attached - don't make fun of it please :) - very prelim. I think it only uses one other file, (TRQueue) also attached (but: note, it's in a different package). Also any comments in the code may be inaccurate. The general goal is as stated in my earlier mail, examples are: AlphaBeta - Alpha (incr 0) Beta (incr 0) AlphaBeta (incr 1) MAX_INT - MAX (incr 0) INT (incr 0) MAX_INT (incr 1) thx, Dave Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] package com.tropo.lucene; import org.apache.lucene.analysis.*; import java.io.*; import java.util.*; import com.tropo.util.*; import java.util.regex.*; /** * Try to parse javadoc better than othe analyzers. */ public final class JavadocAnalyzer extends Analyzer { // [A-Za-z0-9._]+ // public final TokenStream tokenStream( String fieldName, Reader reader) { return new LowerCaseFilter( new JStream( fieldName, reader)); } /** * Try to break up a token into subset/subtokens that might be said to occur in the same place. */ public static List breakup( String s) { // a - null // alphaBeta - alpha, Beta // XXAlpha - ?, Alpha // BIG_NUM - BIG, NUM List lis = new LinkedList(); Matcher m; m = breakupPattern.matcher( s); while (m.find()) { String g = m.group(); if ( ! g.equals( s)) lis.add( g); } // hard ones m = breakupPattern2.matcher( s); while (m.find()) { String g; if ( m.groupCount() == 2) // wierd XXFoo case g = m.group( 2); else g = m.group(); if ( ! g.equals( s)) lis.add( g); /* o.println( gc: + m.groupCount() + / + m.group( 0) + / + m.group( 1) + / + m.group( 2)); */ //lis.add( m.group()); } return lis; } /** * */ private static class JStream extends TokenStream { private TRQueue q = new TRQueue(); private Set already = new HashSet(); private String fieldName; private PushbackReader pb; private StringBuffer sb = new StringBuffer( 32); private int offset; // eat white // have private int state = 0; /** * */ private JStream( String fieldName, Reader reader) { this.fieldName = fieldName; pb = new PushbackReader( reader); } /** * */ public Token next() throws IOException { if ( q.size() 0) // pre-calculated return (Token) q.dequeue(); int c; int start = offset; sb.setLength( 0); offset--; boolean done = false; String type = mystery; state = 0; while ( ! done ( c = pb.read()) != -1) { char ch = (char) c; offset++; switch( state) { case 0: if ( Character.isJavaIdentifierStart( ch)) { start = offset; sb.append( ch); state = 1; type = id; } else if ( Character.isDigit( ch))
Re: amusing interaction between advanced tokenizers and highlighter package
Erik Hatcher wrote: On Jun 19, 2004, at 2:29 AM, David Spencer wrote: A naive analyzer would turn something like SyncThreadPool into one token. Mine uses the great Lucene capability of Tokens being able to have a 0 position increment to turn it into the token stream: Sync (incr = 0) Thread (incr = 0) Pool (incr = 0) SyncThreadPool (incr = 1) [As an aside maybe it should also pair up the subtokens, so SyncThread and ThreadPool appear too]. The point behind this is someone searching for threadpool probably would want to see a match for SyncThreadPool even this is the evil leading-prefix case. With most other Analyzers and ways of forming a query this would be missed, which I think is anti-human and annoys me to no end. There are indexing/querying solutions/workarounds to the leading-prefix issue, such as reversing the text as you index it and ensuring you do the same on queries so they match. There are some interesting techniques for this type of thing in the Managing Gigabytes book I'm currently reading, which Lucene could support with custom analysis and queries, I believe. Yeah, great book. I thought my approach fit into Lucene the most naturally for my goals - and no doubt, things like just having the possibility of different pos increments is a great concept that I haven't seen in other search engines. I keep meaning to try an idea that appeared on the list some months ago, bumping up the incr between sentences so that it's harders for, say, a 2 word phrase to match w/ 1 word in each sentence (makes sense to a computer, but usually not what a human wants). Another side project... The problem is as follows. In all cases I use my Analyzer to index the documents. If I use my Analyzer with with the Highligher package, it doesn't look at the position increment of the tokens and consequently a nonsense stream of matches is output. If I use a different Analyzer w/ the highlighter (say, the StandardAnalyzer), then it doesn't show the matches that really matched, as it doesn't see the subtokens. Are your subtokens marked with correct offset values? This probably doesn't relate to the problem you're seeing, but I'm curious. I think so but this is the first time I've done this kind of thing. When I hit the special case several of the subtokens are 1st returned w/ an incr of 0, then the normal token, w/ an incr of 1 - which seems to make sense to me at least. It might be the fix is for the Highlighter to look at the position increment of tokens and only pass by one if multiple ones have an incr of 0 and match one part of the query. Has this come up before and is the issue clear? The problem is clear, and I've identified this issue with my exploration of the Highlighter also. The Highlighter works well for the most common scenarios, but certainly doesn't cover all the bases. The majority of scenarios do not use multiple tokens in a single position. Also, it also doesn't currently handle the new SpanQuery family - although Highlighting spans would be quite cool. After learning how Highlighter works, I have a deep appreciation for the great work Mark put into it - it is well done. As for this issue, though, I think your solution sounds reasonable, although I haven't thought it through completely. Perhaps Mark can comment. If you do modify it to work for your case, it Oh sure, I'll post any changes but wait for Mark for now. would be great to have your contribution rolled back in :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: amusing interaction between advanced tokenizers and highlighter
A question before I dive into coding a fix: can I assume (for all analyzers) that the tokens produced by the tokenStream have the following property: currentToken.startOffset() = lastToken.startOffset() The analyzers I have tested the highlighter with so far have the property: currentToken.startOffset() lastToken.endOffset() so aren't overlapping but I understand this isn't the case for others (all demonstrable examples of such problem analyzers would be appreciated for testing purposes). If I can assume that tokenstreams always produce a zero or more increment in token.startOffset I think I can design a solution that still works using a single pass of the token stream. I suspect an additional flushText method will be required on the Formatter interface to allow implementations to use a buffer. This buffer would be required to accumulate overlapping token scores when trying to decide if a section of the original text merited any highlight markup. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on how to build a query
Erik Hatcher wrote: On Jun 19, 2004, at 1:51 AM, Jason St. Louis wrote: I wrote my indexer so that it added each field without tokenizing it: Field fnameField = new Field(fname, fname.toLowerCase(), true, true, false); Field lnameField = new Field(lname, lname.toLowerCase(), true, true, false); Field cityField = new Field(city, position.toLowerCase(), true, true, false); By the way, if this is the case, is the indexer even using the analyzer that I pass to it? No. Tokenized fields are analyzed. Non-tokenized fields are left as-is. It might be clearer if you used Field.Keyword instead, which is identical to what you have here. That's what I figured. I suppose if I don't want to store the field values in the index, I can't use Field.Keyword, though. I just realized that I'm storing those 3 fields when I don't need to. The only field I need to store is the primary key of the person in the database (not pictured in the above code) which I use to retrieve the full record from the database later. Then in my search code I create the firstname query as a WildcardQuery if the first name is provided (adding a * to the end if it's not already there): Term fnameTerm = null; Query fnameQuery = null; if( fnameIn.length() 0) { if( !fnameIn.endsWith(*) ) { fnameIn += *; } fnameTerm = new Term(fname, fnameIn); fnameQuery = new WildcardQuery(fnameTerm); } I recommend PrefixQuery in this case. Excellent. That actually works much better than the WildcardQuery for what I'm trying to do here. I presume you lowercased fnameIn? You should to get it to match what was indexed. Yes, I did. Does this seem like a reasonable way to approach this problem, or am I missing something that's going to bite me in the you-know-what? Seems reasonable to me as long as you are lowercasing the strings at query time also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Thanks for your response. I really appreciate it. Jason - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]