Re: Question on how to build a query

2004-06-19 Thread Jason St. Louis
Well, I seem to have gotten something to work.  Maybe someone could just 
 comment on my approach.

I wrote my indexer so that it added each field without tokenizing it:
Field fnameField = new Field(fname, fname.toLowerCase(), true, true, 
false);
Field lnameField = new Field(lname, lname.toLowerCase(), true, true, 
false);
Field cityField = new Field(city, position.toLowerCase(), true, true, 
false);

By the way, if this is the case, is the indexer even using the analyzer 
that I pass to it?

Then in my search code I create the firstname query as a WildcardQuery 
if the first name is provided (adding a * to the end if it's not already 
there):

Term fnameTerm = null;
Query fnameQuery = null;
if( fnameIn.length()  0)
{
if( !fnameIn.endsWith(*) )
{
fnameIn += *;
}
fnameTerm = new Term(fname, fnameIn);
fnameQuery = new WildcardQuery(fnameTerm);
}
I then create my lastname query as either a WildcardQuery or a term 
query depending on whether it contains a *:

Term lnameTerm = new Term(lname, lnameIn);
Query lnameQuery = null;
if( lnameIn.indexOf(*) != -1 )
{
lnameQuery = new WildcardQuery(lnameTerm);
}
else
{
lnameQuery = new TermQuery(lnameTerm);
}
Lastly, I create the city query as a TermQuery.
Finally, I add the 3 queries to a booleanQuery, not adding the first 
name query if it is null (this means a first name was not provided) and 
making lastname and city required:

if(fnameQuery != null)
{
overallQuery.add(fnameQuery, true, false);
}
overallQuery.add(lnameQuery, true, false);
overallQuery.add(positionQuery, true, false);
I then search my index and it appears to work. I haven't tested it 
extensively yet, though.

Does this seem like a reasonable way to approach this problem, or am I 
missing something that's going to bite me in the you-know-what?

Thanks.
Jason
Jason St. Louis wrote:
Hi everyone.  I'm wondering if someone could help me out.
I have created an index of a database of person records where I have 
created documents with the following fields:
database primary_key (stored, not indexed)
first name (indexed)
last name (indexed)
city (indexed)

I used SimpleAnalyzer when creating the index.
I am providing a web based form to search this index.  The form has 3 
fields for first name, last name and city (city is a drop down list).

I want to take the users input and from these 3 fields and build a query 
such that:
A)last name is mandatory and can be wildcarded (I will probably make 
sure the value begins with at least one letter)
B)First name can be wildcarded (same as last name, although if it is 
left blank, I would probably just search the last_name and city and 
ignore the first name)
C)city is mandatory and must match exactly

How would I go about building this query?
Do I create a wildcard query for first name and last name, a term query 
for city and then combine them into boolean query where all 3 terms must 
be matched?  I kind of feel like I'm grasping at straws here.  I think I 
just need a jumpstart to understand how the Query API works.

Thanks.
Jason

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
I've run across an amusing interaction between advanced 
Analyzers/TokenStreams and the very useful term highlighter: 
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/

I have a custom Analyzer I'm using to index javadoc-generated web pages.
The Analyzer in turn has a custom TokenStream which tries to more 
intelligently tokenize java-language tokens.

A naive analyzer would turn something like SyncThreadPool into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a 0 position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so SyncThread 
and ThreadPool appear too].

The point behind this is someone searching for threadpool probably 
would want to see a match for SyncThreadPool even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me to 
no end.

So the analyzer/tokenizer works great, and I have a demo site about to 
come up that indexes lots of publicly avail javadoc as a kind of 
resource so you can easily find what's already been done.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't look 
at the position increment of the tokens and consequently a nonsense 
stream of matches is output. If I use a different Analyzer w/ the 
highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the subtokens.

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re:amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread markharw00d
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs 
- can you email me or post here the source to your analyzer?

Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question on how to build a query

2004-06-19 Thread Erik Hatcher
On Jun 19, 2004, at 1:51 AM, Jason St. Louis wrote:
I wrote my indexer so that it added each field without tokenizing it:
Field fnameField = new Field(fname, fname.toLowerCase(), true, true, 
false);
Field lnameField = new Field(lname, lname.toLowerCase(), true, true, 
false);
Field cityField = new Field(city, position.toLowerCase(), true, 
true, false);

By the way, if this is the case, is the indexer even using the 
analyzer that I pass to it?
No.  Tokenized fields are analyzed.  Non-tokenized fields are left 
as-is.  It might be clearer if you used Field.Keyword instead, which is 
identical to what you have here.

Then in my search code I create the firstname query as a WildcardQuery 
if the first name is provided (adding a * to the end if it's not 
already there):

Term fnameTerm = null;
Query fnameQuery = null;
if( fnameIn.length()  0)
{
if( !fnameIn.endsWith(*) )
{
fnameIn += *;
}
fnameTerm = new Term(fname, fnameIn);
fnameQuery = new WildcardQuery(fnameTerm);
}
I recommend PrefixQuery in this case.
I presume you lowercased fnameIn?  You should to get it to match what 
was indexed.

Does this seem like a reasonable way to approach this problem, or am I 
missing something that's going to bite me in the you-know-what?
Seems reasonable to me as long as you are lowercasing the strings at 
query time also.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread Erik Hatcher
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
A naive analyzer would turn something like SyncThreadPool into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a 0 position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so 
SyncThread and ThreadPool appear too].

The point behind this is someone searching for threadpool probably 
would want to see a match for SyncThreadPool even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me 
to no end.
There are indexing/querying solutions/workarounds to the leading-prefix 
issue, such as reversing the text as you index it and ensuring you do 
the same on queries so they match.  There are some interesting 
techniques for this type of thing in the Managing Gigabytes book I'm 
currently reading, which Lucene could support with custom analysis and 
queries, I believe.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't 
look at the position increment of the tokens and consequently a 
nonsense stream of matches is output. If I use a different Analyzer w/ 
the highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the subtokens.
Are your subtokens marked with correct offset values?  This probably 
doesn't relate to the problem you're seeing, but I'm curious.

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?
The problem is clear, and I've identified this issue with my 
exploration of the Highlighter also.  The Highlighter works well for 
the most common scenarios, but certainly doesn't cover all the bases.  
The majority of scenarios do not use multiple tokens in a single 
position.  Also, it also doesn't currently handle the new SpanQuery 
family - although Highlighting spans would be quite cool.  After 
learning how Highlighter works, I have a deep appreciation for the 
great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, 
although I haven't thought it through completely.  Perhaps Mark can 
comment.  If you do modify it to work for your case, it would be great 
to have your contribution rolled back in :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


stop words in index

2004-06-19 Thread lucene
Hi!

How comes that stop words show up in index (HighFreqTerms)? Yes, I do you the 
same analyzer for indexing and searching.

class SearchFacade
{
private final static String[] GERMAN_STOP_WORDS = new String[] { foo, 
bar };
private final static Analyzer GERMAN_ANALYZER = new 
SnowballAnalyzer( German2, GERMAN_STOP_WORDS );

public void index()
{
writer = new IndexWriter( Configuration.Lucene.INDEX, GERMAN_ANALYZER, 
true );
...
}

public void search(String q)
{
final Query q = MultiFieldQueryParser.parse( query, new String[] { 
blah, 
foo, bar }, GERMAN_ANALYZER );
...
}
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
[EMAIL PROTECTED] wrote:
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs 
- can you email me or post here the source to your analyzer?
 

Code attached - don't make fun of it please :) - very prelim. I think it 
only uses one other file, (TRQueue) also attached (but: note, it's in a 
different package). Also any comments in the code may be inaccurate. The 
general goal is as stated in my earlier mail, examples are:

AlphaBeta -
Alpha (incr 0)
Beta (incr 0)
AlphaBeta (incr 1)
MAX_INT -
MAX (incr 0)
INT (incr 0)
MAX_INT (incr 1)
thx,
Dave
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


package com.tropo.lucene;

import org.apache.lucene.analysis.*;
import java.io.*;
import java.util.*;
import com.tropo.util.*;
import java.util.regex.*;
/**
 * Try to parse javadoc better than othe analyzers.
 */
public final class JavadocAnalyzer
extends Analyzer
{

// [A-Za-z0-9._]+
// 
public final TokenStream tokenStream( String fieldName, Reader reader)
{
return new LowerCaseFilter( new JStream( fieldName, reader));
}

/**
 * Try to break up a token into subset/subtokens that might be said to occur 
in the same place.
 */
public static List breakup( String s)
{
// a - null
// alphaBeta - alpha, Beta
// XXAlpha - ?, Alpha
// BIG_NUM - BIG, NUM

List lis = new LinkedList();

Matcher m;

m = breakupPattern.matcher( s);
while (m.find())
{
String g = m.group();
if ( ! g.equals( s))
lis.add( g);
}

// hard ones
m = breakupPattern2.matcher( s);
while (m.find())
{
String g;
if ( m.groupCount() == 2) // wierd XXFoo case
g = m.group( 2);
else
g = m.group();
if ( ! g.equals( s))
lis.add( g);
/*
o.println( gc:  + m.groupCount() +
   / + m.group( 0) + / + m.group( 1) + / 
+ m.group( 2));
*/
//lis.add( m.group());
}   
return lis;
}   


/**
 *
 */
private static class JStream
extends TokenStream
{
private TRQueue q = new TRQueue();
private Set already = new HashSet();
private String fieldName;
private PushbackReader pb;

private StringBuffer sb = new StringBuffer( 32);
private int offset;

// eat white
// have 
private int state = 0;


/**
 *
 */
private JStream( String fieldName, Reader reader)
{
this.fieldName = fieldName;
pb = new PushbackReader( reader);
}


/**
 *
 */
public Token next()
throws IOException
{
if ( q.size()  0) // pre-calculated
return (Token) q.dequeue();
int c;
int start = offset;
sb.setLength( 0);
offset--;
boolean done = false;
String type = mystery;
state = 0;

while ( ! done 
( c = pb.read()) != -1)
{
char ch = (char) c;
offset++;
switch( state)
{
case 0:
if ( Character.isJavaIdentifierStart( ch))
{
start = offset;
sb.append( ch);
state = 1;
type = id;
}
else if ( Character.isDigit( ch))
  

Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
Erik Hatcher wrote:
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
A naive analyzer would turn something like SyncThreadPool into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a 0 position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so 
SyncThread and ThreadPool appear too].

The point behind this is someone searching for threadpool probably 
would want to see a match for SyncThreadPool even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me 
to no end.

There are indexing/querying solutions/workarounds to the 
leading-prefix issue, such as reversing the text as you index it and 
ensuring you do the same on queries so they match.  There are some 
interesting techniques for this type of thing in the Managing 
Gigabytes book I'm currently reading, which Lucene could support with 
custom analysis and queries, I believe.
Yeah, great book. I thought my approach fit into Lucene the most 
naturally for my goals - and no doubt, things like just having the 
possibility of different pos increments is a great concept that I 
haven't seen in other search engines. I keep meaning to try an idea that 
appeared on the list some months ago, bumping up the incr between 
sentences so that it's harders for, say, a 2 word phrase to match w/ 1 
word in each sentence (makes sense to a computer, but usually not what a 
human wants).  Another side project...


The problem is as follows. In all cases I use my Analyzer to index 
the documents.
If I use my Analyzer with with the Highligher package,  it doesn't 
look at the position increment of the tokens and consequently a 
nonsense stream of matches is output. If I use a different Analyzer 
w/ the highlighter (say, the StandardAnalyzer), then it doesn't show 
the matches that really matched, as it doesn't see the subtokens.

Are your subtokens marked with correct offset values?  This probably 
doesn't relate to the problem you're seeing, but I'm curious.

I think so but this is the first time I've done this kind of thing. When 
I hit the special case several of the subtokens are 1st returned w/ an 
incr of 0, then the normal token, w/ an incr of 1 - which seems to make 
sense to me at least.


It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an 
incr of 0 and match one part of the query.

Has this come up before and is the issue clear?

The problem is clear, and I've identified this issue with my 
exploration of the Highlighter also.  The Highlighter works well for 
the most common scenarios, but certainly doesn't cover all the bases.  
The majority of scenarios do not use multiple tokens in a single 
position.  Also, it also doesn't currently handle the new SpanQuery 
family - although Highlighting spans would be quite cool.  After 
learning how Highlighter works, I have a deep appreciation for the 
great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, 
although I haven't thought it through completely.  Perhaps Mark can 
comment.  If you do modify it to work for your case, it
Oh sure, I'll post any changes but wait for Mark for now.
would be great to have your contribution rolled back in :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: amusing interaction between advanced tokenizers and highlighter

2004-06-19 Thread markharw00d
A question before I dive into coding a fix: can I assume (for all analyzers) that the 
tokens produced by the tokenStream 
have the following property: 
   currentToken.startOffset() = lastToken.startOffset()

The analyzers I have tested the highlighter with so far have the property:
   currentToken.startOffset()  lastToken.endOffset()
so aren't overlapping but I understand this isn't the case for others (all 
demonstrable examples of such problem analyzers 
would be appreciated for testing purposes).
If I can assume that tokenstreams always produce a zero or more increment in 
token.startOffset I think I can 
design a solution that still works using a single pass of the token stream.
I suspect an additional flushText method will be required on the Formatter interface 
to allow implementations
to use a buffer. This buffer would be required to accumulate overlapping token scores 
when trying to decide if a 
section of the original text merited any highlight markup.

Cheers
Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question on how to build a query

2004-06-19 Thread Jason St. Louis

Erik Hatcher wrote:
On Jun 19, 2004, at 1:51 AM, Jason St. Louis wrote:
I wrote my indexer so that it added each field without tokenizing it:
Field fnameField = new Field(fname, fname.toLowerCase(), true, true, 
false);
Field lnameField = new Field(lname, lname.toLowerCase(), true, true, 
false);
Field cityField = new Field(city, position.toLowerCase(), true, 
true, false);

By the way, if this is the case, is the indexer even using the 
analyzer that I pass to it?

No.  Tokenized fields are analyzed.  Non-tokenized fields are left 
as-is.  It might be clearer if you used Field.Keyword instead, which is 
identical to what you have here.
That's what I figured.  I suppose if I don't want to store the field 
values in the index, I can't use Field.Keyword, though. I just realized 
that I'm storing those 3 fields when I don't need to.  The only field I 
need to store is the primary key of the person in the database (not 
pictured in the above code) which I use to retrieve the full record from 
the database later.


Then in my search code I create the firstname query as a WildcardQuery 
if the first name is provided (adding a * to the end if it's not 
already there):

Term fnameTerm = null;
Query fnameQuery = null;
if( fnameIn.length()  0)
{
if( !fnameIn.endsWith(*) )
{
fnameIn += *;
}
fnameTerm = new Term(fname, fnameIn);
fnameQuery = new WildcardQuery(fnameTerm);
}

I recommend PrefixQuery in this case.
Excellent.  That actually works much better than the WildcardQuery for 
what I'm trying to do here.

I presume you lowercased fnameIn?  You should to get it to match what 
was indexed.
Yes, I did.

Does this seem like a reasonable way to approach this problem, or am I 
missing something that's going to bite me in the you-know-what?

Seems reasonable to me as long as you are lowercasing the strings at 
query time also.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Thanks for your response.  I really appreciate it.
Jason
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]