RE: Disabling modifiers?
Treating them as two separate words when quoted is indicative of your analyzer not being sufficient for your domain. What Analyzer are you using? Do you have knowledge of what it is tokenizing text into? I have created a custom analyzer (CobolAnalyzer) which contains some custom stop words for the language, but it's using the StandardTokenizer and StandardFilters. I'll have a look and see if I can see what it's actually tokenizing the text into... Any ideas, or am I going to have to try and write my own query parser? Well, if I manage to get something working, I'll let you know :-) Thanks, Iain * * Micro Focus Developer Forum 2004 * * 3 days that will make a difference * * www.microfocus.com/devforum * * - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disabling modifiers?
Thanks Gregor, I'll give it a try... Iain * * Micro Focus Developer Forum 2004 * * 3 days that will make a difference * * www.microfocus.com/devforum * * -Original Message- From: Gregor Heinrich [mailto:[EMAIL PROTECTED] Sent: 15 December 2003 18:32 To: 'Lucene Users List' Subject: RE: Disabling modifiers? If you don't want to fiddle with the JavaCC source of QueryParser.jj, you could work with a regular expression that works in front of the actual query parser. I just did something similar because I input Lucene's query strings into a latent semantic analysis algorithm and remove words with + and ? wildcards, boosting modifiers as well as NOT and - clauses and groupings. Such as: /** * exclude words that have these modifiers */ public final String excludeWildcards = \\w+\\+|\\w+\\?; /** * remove these operators */ public final String removeOperators = AND|OR|UND|ODER||\\|\\|; /** * remove these modifiers */ public final String removeModifiers = ~[0-9\\.]*|~|\\^[0-9\\.]*|\\*; /** * exclude phrases that have these modifiers */ public final String excludeNot = (NOT |\\-) *\\w+|(NOT|\\-) *\\([^\\)]+\\)|(NOT |\\-) *\\\[^\\\]+\\\; /** * remove any groupings */ public final String removeGrouping = [\(\\)]; You then create Pattern objects from the strings using Pattern.compile() and can use and re-use the compiled patterns. excludeWildcardsPattern = Pattern.compile(excludeWildcards); lsaQ = excludeWildcardsPattern.matcher(q).replaceAll(); This works fine for me. However, this 20 minutes approach does not recognise nested parentheses with NOT or -, i.e., the term ttNOT ((a OR b) AND (c OR d))/tt will result in the removal of ttNOT ((a OR b/tt and ttc d/tt will still be in the output query. Best regards, Gregor -Original Message- From: Iain Young [mailto:[EMAIL PROTECTED] Sent: Monday, December 15, 2003 6:13 PM To: Lucene mailing list (E-mail) Subject: Disabling modifiers? A quick question. Is there any way to disable the - and + modifiers in the QueryParser? I'm trying to use Lucene to provide indexing of COBOL source code, and allow me to highlight matches when the code is displayed. In COBOL you can have variable names such as DISP-NAME and WS-DATE-1 for example. Unfortunately the query parser interprets the - signs as modifiers and so the query does not do what is required. I've had a bit of success by putting quotes around the offending names, (as suggested on this list), but the results are still less than satisfactory, (it removes the NOT from the query, but still treats DISP and NAME as two separate words rather than one word and so the results are not quite correct). Any ideas, or am I going to have to try and write my own query parser? Thanks, Iain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disabling modifiers?
I think it is a problem with the indexing. I've found another example... WS-CA-PP00-PROCESS-YYMM I've looked at the index, and it has been tokenized into 3 words... WS CA-PP00-PROCESS YYMM Looks as though I might have to use a custom tokenizer as well as an analyzer then, but any ideas as to why the standard tokenizer would have split the variable up like this (i.e. why didn't it split the middle bit, only the word off either end)? The only thing I can think of is that there are several other variables in the source beginning with WS- or ending with -YYMM, so could the tokenizer have seen this and be doing something clever with them? Thanks, Iain * * Micro Focus Developer Forum 2004 * * 3 days that will make a difference * * www.microfocus.com/devforum * * - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disabling modifiers?
On Tuesday, December 16, 2003, at 05:46 AM, Iain Young wrote: Treating them as two separate words when quoted is indicative of your analyzer not being sufficient for your domain. What Analyzer are you using? Do you have knowledge of what it is tokenizing text into? I have created a custom analyzer (CobolAnalyzer) which contains some custom stop words for the language, but it's using the StandardTokenizer and StandardFilters. I'll have a look and see if I can see what it's actually tokenizing the text into... Look at my article at java.net and try out the AnalyzerDemo code using some sample text and your custom analyzer: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html One of the things I plan to do with an enhanced Lucene demo to ship with Lucene's binary distributions is integrate in this type of analyzing the analyzer feature. It is the root of a lot of questions about Lucene. You can really only search for what you index, and you only index what the Analyzer creates, so understanding it is key to a lot. And yes, if you are using StandardTokenizer, you are probably not tokenizing COBOL quite like you expect. Is there a COBOL parser you could tap into that could give you the tokens you want? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disabling modifiers?
grin Yes we have got one or two parsers floating around somewhere or other ;) Unfortunately, I'm unlikely to be able to tap into these before next version of the product I'm working on (can't say too much because of the nda etc), and so for now I'm having to make do with a basic text search. I'll give the whitespace analyzer a try and see if I get any better results. Thanks, Iain * * Micro Focus Developer Forum 2004 * * 3 days that will make a difference * * www.microfocus.com/devforum * * -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: 16 December 2003 12:31 To: Lucene Users List Subject: Re: Disabling modifiers? On Tuesday, December 16, 2003, at 07:28 AM, Erik Hatcher wrote: And yes, if you are using StandardTokenizer, you are probably not tokenizing COBOL quite like you expect. Is there a COBOL parser you could tap into that could give you the tokens you want? Ummm. nevermind that last question... I just realized where you work! :) So, my recommendation would be to tap into some parser for the COBOL language that you have handy and have it feed your Analyzer appropriately. Or, use something very very simple like the WhitespaceAnalyzer as a first try. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disabling modifiers?
One of the token patterns defined by the StandardTokenizer.jj is this: NUM: (ALPHANUM P HAS_DIGIT | HAS_DIGIT P ALPHANUM | ALPHANUM (P HAS_DIGIT P ALPHANUM)+ | HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+ | ALPHANUM P HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+ | HAS_DIGIT P ALPHANUM (P HAS_DIGIT P ALPHANUM)+ ) So basically if you have some sequences of characters separated by a - character, sequences that contain a digit will be combined with sequences which are adjacent to it to form a single token. That explains why the WS and YYMM sequences got separated out. You can alter this behavior this with some simple changes to StandardTokenizer.jj. - Original Message - From: Iain Young [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 7:46 AM Subject: RE: Disabling modifiers? I think it is a problem with the indexing. I've found another example... WS-CA-PP00-PROCESS-YYMM I've looked at the index, and it has been tokenized into 3 words... WS CA-PP00-PROCESS YYMM Looks as though I might have to use a custom tokenizer as well as an analyzer then, but any ideas as to why the standard tokenizer would have split the variable up like this (i.e. why didn't it split the middle bit, only the word off either end)? The only thing I can think of is that there are several other variables in the source beginning with WS- or ending with -YYMM, so could the tokenizer have seen this and be doing something clever with them? Thanks, Iain * * Micro Focus Developer Forum 2004 * * 3 days that will make a difference * * www.microfocus.com/devforum * * - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disabling modifiers?
The WhitespaceTokenizer fixed the problem, so that'll do as a stop gap until I can figure out how to write our own COBOL tokenizer. Thanks for the help, Iain * * Micro Focus Developer Forum 2004 * * 3 days that will make a difference * * www.microfocus.com/devforum * * -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: 16 December 2003 12:31 To: Lucene Users List Subject: Re: Disabling modifiers? On Tuesday, December 16, 2003, at 07:28 AM, Erik Hatcher wrote: And yes, if you are using StandardTokenizer, you are probably not tokenizing COBOL quite like you expect. Is there a COBOL parser you could tap into that could give you the tokens you want? Ummm. nevermind that last question... I just realized where you work! :) So, my recommendation would be to tap into some parser for the COBOL language that you have handy and have it feed your Analyzer appropriately. Or, use something very very simple like the WhitespaceAnalyzer as a first try. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Disabling modifiers?
A quick question. Is there any way to disable the - and + modifiers in the QueryParser? I'm trying to use Lucene to provide indexing of COBOL source code, and allow me to highlight matches when the code is displayed. In COBOL you can have variable names such as DISP-NAME and WS-DATE-1 for example. Unfortunately the query parser interprets the - signs as modifiers and so the query does not do what is required. I've had a bit of success by putting quotes around the offending names, (as suggested on this list), but the results are still less than satisfactory, (it removes the NOT from the query, but still treats DISP and NAME as two separate words rather than one word and so the results are not quite correct). Any ideas, or am I going to have to try and write my own query parser? Thanks, Iain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disabling modifiers?
On Monday, December 15, 2003, at 12:12 PM, Iain Young wrote: A quick question. Is there any way to disable the - and + modifiers in the QueryParser? Not currently. I've had a bit of success by putting quotes around the offending names, (as suggested on this list), but the results are still less than satisfactory, (it removes the NOT from the query, but still treats DISP and NAME as two separate words rather than one word and so the results are not quite correct). Treating them as two separate words when quoted is indicative of your analyzer not being sufficient for your domain. What Analyzer are you using? Do you have knowledge of what it is tokenizing text into? Any ideas, or am I going to have to try and write my own query parser? This is an open issue in Lucene. You and the community would be better served if you were able to fix the existing QueryParser and submit the fix back to us. Is it possible someone has already done this and it is pending in Bugzilla? (I haven't checked, searching Bugzilla with Safari doesn't work *sigh* - so it is a pain for me to do). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disabling modifiers?
If you don't want to fiddle with the JavaCC source of QueryParser.jj, you could work with a regular expression that works in front of the actual query parser. I just did something similar because I input Lucene's query strings into a latent semantic analysis algorithm and remove words with + and ? wildcards, boosting modifiers as well as NOT and - clauses and groupings. Such as: /** * exclude words that have these modifiers */ public final String excludeWildcards = \\w+\\+|\\w+\\?; /** * remove these operators */ public final String removeOperators = AND|OR|UND|ODER||\\|\\|; /** * remove these modifiers */ public final String removeModifiers = ~[0-9\\.]*|~|\\^[0-9\\.]*|\\*; /** * exclude phrases that have these modifiers */ public final String excludeNot = (NOT |\\-) *\\w+|(NOT|\\-) *\\([^\\)]+\\)|(NOT |\\-) *\\\[^\\\]+\\\; /** * remove any groupings */ public final String removeGrouping = [\(\\)]; You then create Pattern objects from the strings using Pattern.compile() and can use and re-use the compiled patterns. excludeWildcardsPattern = Pattern.compile(excludeWildcards); lsaQ = excludeWildcardsPattern.matcher(q).replaceAll(); This works fine for me. However, this 20 minutes approach does not recognise nested parentheses with NOT or -, i.e., the term ttNOT ((a OR b) AND (c OR d))/tt will result in the removal of ttNOT ((a OR b/tt and ttc d/tt will still be in the output query. Best regards, Gregor -Original Message- From: Iain Young [mailto:[EMAIL PROTECTED] Sent: Monday, December 15, 2003 6:13 PM To: Lucene mailing list (E-mail) Subject: Disabling modifiers? A quick question. Is there any way to disable the - and + modifiers in the QueryParser? I'm trying to use Lucene to provide indexing of COBOL source code, and allow me to highlight matches when the code is displayed. In COBOL you can have variable names such as DISP-NAME and WS-DATE-1 for example. Unfortunately the query parser interprets the - signs as modifiers and so the query does not do what is required. I've had a bit of success by putting quotes around the offending names, (as suggested on this list), but the results are still less than satisfactory, (it removes the NOT from the query, but still treats DISP and NAME as two separate words rather than one word and so the results are not quite correct). Any ideas, or am I going to have to try and write my own query parser? Thanks, Iain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]