Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
Erik Hatcher wrote:
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
A naive analyzer would turn something like "SyncThreadPool" into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a "0" position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so 
"SyncThread" and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably 
would want to see a match for "SyncThreadPool" even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me 
to no end.

There are indexing/querying solutions/workarounds to the 
leading-prefix issue, such as reversing the text as you index it and 
ensuring you do the same on queries so they match.  There are some 
interesting techniques for this type of thing in the Managing 
Gigabytes book I'm currently reading, which Lucene could support with 
custom analysis and queries, I believe.
Yeah, great book. I thought my approach fit into Lucene the most 
naturally for my goals - and no doubt, things like just having the 
possibility of different pos increments is a great concept that I 
haven't seen in other search engines. I keep meaning to try an idea that 
appeared on the list some months ago, bumping up the incr between 
sentences so that it's harders for, say, a 2 word phrase to match w/ 1 
word in each sentence (makes sense to a computer, but usually not what a 
human wants).  Another side project...


The problem is as follows. In all cases I use my Analyzer to index 
the documents.
If I use my Analyzer with with the Highligher package,  it doesn't 
look at the position increment of the tokens and consequently a 
nonsense stream of matches is output. If I use a different Analyzer 
w/ the highlighter (say, the StandardAnalyzer), then it doesn't show 
the matches that really matched, as it doesn't see the "subtokens".

Are your "subtokens" marked with correct offset values?  This probably 
doesn't relate to the problem you're seeing, but I'm curious.

I think so but this is the first time I've done this kind of thing. When 
I hit the special case several of the "subtokens" are 1st returned w/ an 
incr of 0, then the normal token, w/ an incr of 1 - which seems to make 
sense to me at least.


It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an 
incr of 0 and match one part of the query.

Has this come up before and is the issue clear?

The problem is clear, and I've identified this issue with my 
exploration of the Highlighter also.  The Highlighter works well for 
the most common scenarios, but certainly doesn't cover all the bases.  
The majority of scenarios do not use multiple tokens in a single 
position.  Also, it also doesn't currently handle the new SpanQuery 
family - although Highlighting spans would be quite cool.  After 
learning how Highlighter works, I have a deep appreciation for the 
great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, 
although I haven't thought it through completely.  Perhaps Mark can 
comment.  If you do modify it to work for your case, it
Oh sure, I'll post any changes but wait for Mark for now.
would be great to have your contribution rolled back in :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
[EMAIL PROTECTED] wrote:
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs 
- can you email me or post here the source to your analyzer?
 

Code attached - don't make fun of it please :) - very prelim. I think it 
only uses one other file, (TRQueue) also attached (but: note, it's in a 
different package). Also any comments in the code may be inaccurate. The 
general goal is as stated in my earlier mail, examples are:

AlphaBeta ->
Alpha (incr 0)
Beta (incr 0)
AlphaBeta (incr 1)
MAX_INT ->
MAX (incr 0)
INT (incr 0)
MAX_INT (incr 1)
thx,
Dave
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


package com.tropo.lucene;

import org.apache.lucene.analysis.*;
import java.io.*;
import java.util.*;
import com.tropo.util.*;
import java.util.regex.*;
/**
 * Try to parse javadoc better than othe analyzers.
 */
public final class JavadocAnalyzer
extends Analyzer
{

// [A-Za-z0-9._]+
// 
public final TokenStream tokenStream( String fieldName, Reader reader)
{
return new LowerCaseFilter( new JStream( fieldName, reader));
}

/**
 * Try to break up a token into subset/subtokens that might be said to occur 
in the same place.
 */
public static List breakup( String s)
{
// "a" -> null
// "alphaBeta" -> "alpha", "Beta"
// "XXAlpha" -> ?, Alpha
// BIG_NUM -> "BIG", "NUM"

List lis = new LinkedList();

Matcher m;

m = breakupPattern.matcher( s);
while (m.find())
{
String g = m.group();
if ( ! g.equals( s))
lis.add( g);
}

// hard ones
m = breakupPattern2.matcher( s);
while (m.find())
{
String g;
if ( m.groupCount() == 2) // wierd XXFoo case
g = m.group( 2);
else
g = m.group();
if ( ! g.equals( s))
lis.add( g);
/*
o.println( "gc: " + m.groupCount() +
   "/" + m.group( 0) + "/" + m.group( 1) + "/" 
+ m.group( 2));
*/
//lis.add( m.group());
}   
return lis;
}   


/**
 *
 */
private static class JStream
extends TokenStream
{
private TRQueue q = new TRQueue();
private Set already = new HashSet();
private String fieldName;
private PushbackReader pb;

private StringBuffer sb = new StringBuffer( 32);
private int offset;

// eat white
// have 
private int state = 0;


/**
 *
 */
private JStream( String fieldName, Reader reader)
{
this.fieldName = fieldName;
pb = new PushbackReader( reader);
}


/**
 *
 */
public Token next()
throws IOException
{
if ( q.size() > 0) // pre-calculated
return (Token) q.dequeue();
int c;
int start = offset;
sb.setLength( 0);
offset--;
boolean done = false;
String type = "mystery";
state = 0;

while ( ! done &&
( c = pb.read()) != -1)
{
char ch = (char) c;
offset++;
switch( state)
{
case 0:
if ( Character.isJavaIdentifierStart( ch))
{
start = offset;
sb.append( ch);
state = 1;
type = "id";
}
else if ( Character.isDigit( ch))
   

Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread Erik Hatcher
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
A naive analyzer would turn something like "SyncThreadPool" into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a "0" position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so 
"SyncThread" and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably 
would want to see a match for "SyncThreadPool" even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me 
to no end.
There are indexing/querying solutions/workarounds to the leading-prefix 
issue, such as reversing the text as you index it and ensuring you do 
the same on queries so they match.  There are some interesting 
techniques for this type of thing in the Managing Gigabytes book I'm 
currently reading, which Lucene could support with custom analysis and 
queries, I believe.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't 
look at the position increment of the tokens and consequently a 
nonsense stream of matches is output. If I use a different Analyzer w/ 
the highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the "subtokens".
Are your "subtokens" marked with correct offset values?  This probably 
doesn't relate to the problem you're seeing, but I'm curious.

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?
The problem is clear, and I've identified this issue with my 
exploration of the Highlighter also.  The Highlighter works well for 
the most common scenarios, but certainly doesn't cover all the bases.  
The majority of scenarios do not use multiple tokens in a single 
position.  Also, it also doesn't currently handle the new SpanQuery 
family - although Highlighting spans would be quite cool.  After 
learning how Highlighter works, I have a deep appreciation for the 
great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, 
although I haven't thought it through completely.  Perhaps Mark can 
comment.  If you do modify it to work for your case, it would be great 
to have your contribution rolled back in :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


amusing interaction between advanced tokenizers and highlighter package

2004-06-18 Thread David Spencer
I've run across an amusing interaction between advanced 
Analyzers/TokenStreams and the very useful "term highlighter": 
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/

I have a custom Analyzer I'm using to index javadoc-generated web pages.
The Analyzer in turn has a custom TokenStream which tries to more 
intelligently tokenize java-language tokens.

A naive analyzer would turn something like "SyncThreadPool" into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a "0" position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so "SyncThread" 
and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably 
would want to see a match for "SyncThreadPool" even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me to 
no end.

So the analyzer/tokenizer works great, and I have a demo site about to 
come up that indexes lots of publicly avail javadoc as a kind of 
resource so you can easily find what's already been done.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't look 
at the position increment of the tokens and consequently a nonsense 
stream of matches is output. If I use a different Analyzer w/ the 
highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the "subtokens".

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]