Re: amusing interaction between advanced tokenizers and highlighter package
Erik Hatcher wrote: On Jun 19, 2004, at 2:29 AM, David Spencer wrote: A naive analyzer would turn something like "SyncThreadPool" into one token. Mine uses the great Lucene capability of Tokens being able to have a "0" position increment to turn it into the token stream: Sync (incr = 0) Thread (incr = 0) Pool (incr = 0) SyncThreadPool (incr = 1) [As an aside maybe it should also pair up the subtokens, so "SyncThread" and "ThreadPool" appear too]. The point behind this is someone searching for "threadpool" probably would want to see a match for "SyncThreadPool" even this is the evil leading-prefix case. With most other Analyzers and ways of forming a query this would be missed, which I think is anti-human and annoys me to no end. There are indexing/querying solutions/workarounds to the leading-prefix issue, such as reversing the text as you index it and ensuring you do the same on queries so they match. There are some interesting techniques for this type of thing in the Managing Gigabytes book I'm currently reading, which Lucene could support with custom analysis and queries, I believe. Yeah, great book. I thought my approach fit into Lucene the most naturally for my goals - and no doubt, things like just having the possibility of different pos increments is a great concept that I haven't seen in other search engines. I keep meaning to try an idea that appeared on the list some months ago, bumping up the incr between sentences so that it's harders for, say, a 2 word phrase to match w/ 1 word in each sentence (makes sense to a computer, but usually not what a human wants). Another side project... The problem is as follows. In all cases I use my Analyzer to index the documents. If I use my Analyzer with with the Highligher package, it doesn't look at the position increment of the tokens and consequently a nonsense stream of matches is output. If I use a different Analyzer w/ the highlighter (say, the StandardAnalyzer), then it doesn't show the matches that really matched, as it doesn't see the "subtokens". Are your "subtokens" marked with correct offset values? This probably doesn't relate to the problem you're seeing, but I'm curious. I think so but this is the first time I've done this kind of thing. When I hit the special case several of the "subtokens" are 1st returned w/ an incr of 0, then the normal token, w/ an incr of 1 - which seems to make sense to me at least. It might be the fix is for the Highlighter to look at the position increment of tokens and only pass by one if multiple ones have an incr of 0 and match one part of the query. Has this come up before and is the issue clear? The problem is clear, and I've identified this issue with my exploration of the Highlighter also. The Highlighter works well for the most common scenarios, but certainly doesn't cover all the bases. The majority of scenarios do not use multiple tokens in a single position. Also, it also doesn't currently handle the new SpanQuery family - although Highlighting spans would be quite cool. After learning how Highlighter works, I have a deep appreciation for the great work Mark put into it - it is well done. As for this issue, though, I think your solution sounds reasonable, although I haven't thought it through completely. Perhaps Mark can comment. If you do modify it to work for your case, it Oh sure, I'll post any changes but wait for Mark for now. would be great to have your contribution rolled back in :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: amusing interaction between advanced tokenizers and highlighter package
[EMAIL PROTECTED] wrote: Yes, this issue has come up before with other choices of analyzers. I think it should be fixable without changing any of the highlighter APIs - can you email me or post here the source to your analyzer? Code attached - don't make fun of it please :) - very prelim. I think it only uses one other file, (TRQueue) also attached (but: note, it's in a different package). Also any comments in the code may be inaccurate. The general goal is as stated in my earlier mail, examples are: AlphaBeta -> Alpha (incr 0) Beta (incr 0) AlphaBeta (incr 1) MAX_INT -> MAX (incr 0) INT (incr 0) MAX_INT (incr 1) thx, Dave Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] package com.tropo.lucene; import org.apache.lucene.analysis.*; import java.io.*; import java.util.*; import com.tropo.util.*; import java.util.regex.*; /** * Try to parse javadoc better than othe analyzers. */ public final class JavadocAnalyzer extends Analyzer { // [A-Za-z0-9._]+ // public final TokenStream tokenStream( String fieldName, Reader reader) { return new LowerCaseFilter( new JStream( fieldName, reader)); } /** * Try to break up a token into subset/subtokens that might be said to occur in the same place. */ public static List breakup( String s) { // "a" -> null // "alphaBeta" -> "alpha", "Beta" // "XXAlpha" -> ?, Alpha // BIG_NUM -> "BIG", "NUM" List lis = new LinkedList(); Matcher m; m = breakupPattern.matcher( s); while (m.find()) { String g = m.group(); if ( ! g.equals( s)) lis.add( g); } // hard ones m = breakupPattern2.matcher( s); while (m.find()) { String g; if ( m.groupCount() == 2) // wierd XXFoo case g = m.group( 2); else g = m.group(); if ( ! g.equals( s)) lis.add( g); /* o.println( "gc: " + m.groupCount() + "/" + m.group( 0) + "/" + m.group( 1) + "/" + m.group( 2)); */ //lis.add( m.group()); } return lis; } /** * */ private static class JStream extends TokenStream { private TRQueue q = new TRQueue(); private Set already = new HashSet(); private String fieldName; private PushbackReader pb; private StringBuffer sb = new StringBuffer( 32); private int offset; // eat white // have private int state = 0; /** * */ private JStream( String fieldName, Reader reader) { this.fieldName = fieldName; pb = new PushbackReader( reader); } /** * */ public Token next() throws IOException { if ( q.size() > 0) // pre-calculated return (Token) q.dequeue(); int c; int start = offset; sb.setLength( 0); offset--; boolean done = false; String type = "mystery"; state = 0; while ( ! done && ( c = pb.read()) != -1) { char ch = (char) c; offset++; switch( state) { case 0: if ( Character.isJavaIdentifierStart( ch)) { start = offset; sb.append( ch); state = 1; type = "id"; } else if ( Character.isDigit( ch))
Re: amusing interaction between advanced tokenizers and highlighter package
On Jun 19, 2004, at 2:29 AM, David Spencer wrote: A naive analyzer would turn something like "SyncThreadPool" into one token. Mine uses the great Lucene capability of Tokens being able to have a "0" position increment to turn it into the token stream: Sync (incr = 0) Thread (incr = 0) Pool (incr = 0) SyncThreadPool (incr = 1) [As an aside maybe it should also pair up the subtokens, so "SyncThread" and "ThreadPool" appear too]. The point behind this is someone searching for "threadpool" probably would want to see a match for "SyncThreadPool" even this is the evil leading-prefix case. With most other Analyzers and ways of forming a query this would be missed, which I think is anti-human and annoys me to no end. There are indexing/querying solutions/workarounds to the leading-prefix issue, such as reversing the text as you index it and ensuring you do the same on queries so they match. There are some interesting techniques for this type of thing in the Managing Gigabytes book I'm currently reading, which Lucene could support with custom analysis and queries, I believe. The problem is as follows. In all cases I use my Analyzer to index the documents. If I use my Analyzer with with the Highligher package, it doesn't look at the position increment of the tokens and consequently a nonsense stream of matches is output. If I use a different Analyzer w/ the highlighter (say, the StandardAnalyzer), then it doesn't show the matches that really matched, as it doesn't see the "subtokens". Are your "subtokens" marked with correct offset values? This probably doesn't relate to the problem you're seeing, but I'm curious. It might be the fix is for the Highlighter to look at the position increment of tokens and only pass by one if multiple ones have an incr of 0 and match one part of the query. Has this come up before and is the issue clear? The problem is clear, and I've identified this issue with my exploration of the Highlighter also. The Highlighter works well for the most common scenarios, but certainly doesn't cover all the bases. The majority of scenarios do not use multiple tokens in a single position. Also, it also doesn't currently handle the new SpanQuery family - although Highlighting spans would be quite cool. After learning how Highlighter works, I have a deep appreciation for the great work Mark put into it - it is well done. As for this issue, though, I think your solution sounds reasonable, although I haven't thought it through completely. Perhaps Mark can comment. If you do modify it to work for your case, it would be great to have your contribution rolled back in :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
amusing interaction between advanced tokenizers and highlighter package
I've run across an amusing interaction between advanced Analyzers/TokenStreams and the very useful "term highlighter": http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/ I have a custom Analyzer I'm using to index javadoc-generated web pages. The Analyzer in turn has a custom TokenStream which tries to more intelligently tokenize java-language tokens. A naive analyzer would turn something like "SyncThreadPool" into one token. Mine uses the great Lucene capability of Tokens being able to have a "0" position increment to turn it into the token stream: Sync (incr = 0) Thread (incr = 0) Pool (incr = 0) SyncThreadPool (incr = 1) [As an aside maybe it should also pair up the subtokens, so "SyncThread" and "ThreadPool" appear too]. The point behind this is someone searching for "threadpool" probably would want to see a match for "SyncThreadPool" even this is the evil leading-prefix case. With most other Analyzers and ways of forming a query this would be missed, which I think is anti-human and annoys me to no end. So the analyzer/tokenizer works great, and I have a demo site about to come up that indexes lots of publicly avail javadoc as a kind of resource so you can easily find what's already been done. The problem is as follows. In all cases I use my Analyzer to index the documents. If I use my Analyzer with with the Highligher package, it doesn't look at the position increment of the tokens and consequently a nonsense stream of matches is output. If I use a different Analyzer w/ the highlighter (say, the StandardAnalyzer), then it doesn't show the matches that really matched, as it doesn't see the "subtokens". It might be the fix is for the Highlighter to look at the position increment of tokens and only pass by one if multiple ones have an incr of 0 and match one part of the query. Has this come up before and is the issue clear? thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]