RE: Disabling modifiers?

2003-12-16 Thread Iain Young
 Treating them as two separate words when quoted is indicative of your 
 analyzer not being sufficient for your domain.  What Analyzer are you 
 using?  Do you have knowledge of what it is tokenizing text into?

I have created a custom analyzer (CobolAnalyzer) which contains some custom
stop words for the language, but it's using the StandardTokenizer and
StandardFilters. I'll have a look and see if I can see what it's actually
tokenizing the text into...

 Any ideas, or am I going to have to try and write my own query parser?

Well, if I manage to get something working, I'll let you know :-)

Thanks,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-16 Thread Iain Young
Thanks Gregor, I'll give it a try...

Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-Original Message-
From: Gregor Heinrich [mailto:[EMAIL PROTECTED]
Sent: 15 December 2003 18:32
To: 'Lucene Users List'
Subject: RE: Disabling modifiers?


If you don't want to fiddle with the JavaCC source of QueryParser.jj, you
could work with a regular expression that works in front of the actual query
parser. I just did something similar because I input Lucene's query strings
into a latent semantic analysis algorithm and remove words with + and ?
wildcards, boosting modifiers as well as NOT and - clauses and groupings.
Such as:

/**
 *  exclude words that have these modifiers
 */
public final String excludeWildcards = \\w+\\+|\\w+\\?;
/**
 *  remove these operators
 */
public final String removeOperators = AND|OR|UND|ODER||\\|\\|;
/**
 *  remove these modifiers
 */
public final String removeModifiers = ~[0-9\\.]*|~|\\^[0-9\\.]*|\\*;
/**
 *  exclude phrases that have these modifiers
 */
public final String excludeNot = (NOT |\\-) *\\w+|(NOT|\\-)
*\\([^\\)]+\\)|(NOT |\\-) *\\\[^\\\]+\\\;

/**
 * remove any groupings
 */
public final String removeGrouping = [\(\\)];

You then create Pattern objects from the strings using Pattern.compile() and
can use and re-use the compiled patterns.

excludeWildcardsPattern = Pattern.compile(excludeWildcards);

lsaQ = excludeWildcardsPattern.matcher(q).replaceAll();

This works fine for me. However, this 20 minutes approach does not recognise
nested parentheses with NOT or -, i.e.,
the term ttNOT ((a OR b) AND (c OR d))/tt will result in the removal
of ttNOT ((a OR b/tt and ttc d/tt will still be in the output
query.

Best regards,

Gregor

-Original Message-
From: Iain Young [mailto:[EMAIL PROTECTED]
Sent: Monday, December 15, 2003 6:13 PM
To: Lucene mailing list (E-mail)
Subject: Disabling modifiers?


A quick question. Is there any way to disable the - and + modifiers in the
QueryParser? I'm trying to use Lucene to provide indexing of COBOL source
code, and allow me to highlight matches when the code is displayed. In COBOL
you can have variable names such as DISP-NAME and WS-DATE-1 for example.
Unfortunately the query parser interprets the - signs as modifiers and so
the query does not do what is required.

I've had a bit of success by putting quotes around the offending names, (as
suggested on this list), but the results are still less than satisfactory,
(it removes the NOT from the query, but still treats DISP and NAME as two
separate words rather than one word and so the results are not quite
correct).

Any ideas, or am I going to have to try and write my own query parser?

Thanks,
Iain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-16 Thread Iain Young
I think it is a problem with the indexing. I've found another example...

WS-CA-PP00-PROCESS-YYMM

I've looked at the index, and it has been tokenized into 3 words...

WS
CA-PP00-PROCESS
YYMM

Looks as though I might have to use a custom tokenizer as well as an
analyzer then, but any ideas as to why the standard tokenizer would have
split the variable up like this (i.e. why didn't it split the middle bit,
only the word off either end)? The only thing I can think of is that there
are several other variables in the source beginning with WS- or ending with
-YYMM, so could the tokenizer have seen this and be doing something clever
with them?

Thanks,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-16 Thread Iain Young
grin Yes we have got one or two parsers floating around somewhere or other
;)

Unfortunately, I'm unlikely to be able to tap into these before next version
of the product I'm working on (can't say too much because of the nda etc),
and so for now I'm having to make do with a basic text search. I'll give the
whitespace analyzer a try and see if I get any better results.

Thanks,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: 16 December 2003 12:31
To: Lucene Users List
Subject: Re: Disabling modifiers?


On Tuesday, December 16, 2003, at 07:28  AM, Erik Hatcher wrote:
 And yes, if you are using StandardTokenizer, you are probably not 
 tokenizing COBOL quite like you expect.  Is there a COBOL parser you 
 could tap into that could give you the tokens you want?

Ummm. nevermind that last question... I just realized where you 
work!  :)

So, my recommendation would be to tap into some parser for the COBOL 
language that you have handy and have it feed your Analyzer 
appropriately.  Or, use something very very simple like the 
WhitespaceAnalyzer as a first try.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-16 Thread Iain Young
The WhitespaceTokenizer fixed the problem, so that'll do as a stop gap until
I can figure out how to write our own COBOL tokenizer.

Thanks for the help,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: 16 December 2003 12:31
To: Lucene Users List
Subject: Re: Disabling modifiers?


On Tuesday, December 16, 2003, at 07:28  AM, Erik Hatcher wrote:
 And yes, if you are using StandardTokenizer, you are probably not 
 tokenizing COBOL quite like you expect.  Is there a COBOL parser you 
 could tap into that could give you the tokens you want?

Ummm. nevermind that last question... I just realized where you 
work!  :)

So, my recommendation would be to tap into some parser for the COBOL 
language that you have handy and have it feed your Analyzer 
appropriately.  Or, use something very very simple like the 
WhitespaceAnalyzer as a first try.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Disabling modifiers?

2003-12-15 Thread Iain Young
A quick question. Is there any way to disable the - and + modifiers in the
QueryParser? I'm trying to use Lucene to provide indexing of COBOL source
code, and allow me to highlight matches when the code is displayed. In COBOL
you can have variable names such as DISP-NAME and WS-DATE-1 for example.
Unfortunately the query parser interprets the - signs as modifiers and so
the query does not do what is required. 

I've had a bit of success by putting quotes around the offending names, (as
suggested on this list), but the results are still less than satisfactory,
(it removes the NOT from the query, but still treats DISP and NAME as two
separate words rather than one word and so the results are not quite
correct).

Any ideas, or am I going to have to try and write my own query parser?

Thanks,
Iain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-02 Thread Iain Young
Well, I've fixed the problem. 

I've upgraded tomcat to version 4.1.29 and it's all working again now,
(using exactly the same tomcat configuration etc as before). Seems as though
there is a compatibility problem of some sort with tomcat 4.0.3.

I am using the basic tomcat security on the webapp, but I wouldn't have
expected this to cause the problem, especially seeing as when I upgraded
tomcat to 4.1.29 the problem went away, (using the same security model).
Also, if it was a security problem, I'd had expected it to have trouble when
creating the indexes, not just when searching them.

At least it's working now

Thanks,
Iain

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: 01 December 2003 23:31
To: Lucene Users List
Subject: Re: Help with Searching indexes from a web app (Lucene 1.3 rc2)


Also, reindex with the new API as well.  There are likely  
incompatibilities in the index format.


On Monday, December 1, 2003, at 11:21  AM, Iain Young wrote:

 Note, that I've just tried the example webapp supplied with Lucene,  
 and I
 appear to be having exactly the same problem with that. The 1.2 version
 works ok, but the 1.3 version is displaying a path not found error.

 Are there any known incompatibilities with certain versions of Tomcat  
 (I'm
 currently using version 4.0.3)

 Thanks,
 Iain

 -Original Message-
 From: Iain Young [mailto:[EMAIL PROTECTED]
 Sent: 01 December 2003 15:40
 To: '[EMAIL PROTECTED]'
 Subject: Help with Searching indexes from a web app (Lucene 1.3 rc2)


 Hi folks.

 I'm new to Lucene so this may be an obvious questions, but I am having
 problems with Lucene 1.3-rc2. I've got a bit of code which looks  
 something
 like this

 public static void getSearchResults(String searchString, String  
 indexDir)
 {
 try
 {
 Searcher searcher = new IndexSearcher(indexDir);
 .
 etc...
 .
 }
 catch (Exception ex)
 {
 }
 }

 I'm calling it to from a web application (servlet) running in tomcat  
 in
 conjunction with struts and velocity. If I use the Lucene 1.2 binary
 release, it all works fine and I get the search results ok. However,  
 when
 I replace the 1.2 jar file with the 1.3-rc2 jat file,  (leaving all  
 of my
 code exactly the same) it stops working, and I get a path not found
 exception being thrown.

 I've narrowed it down to the IndexReader.open(final Directory  
 directory)
 method. Even if I pass a valid Directory object into this (created by
 FSDirectory), it just seems to throw the exception, (even though I  
 know
 the directory object is not null etc). The bizarre thing is that this
 problem only seems to occur when I run it from the web application.  
 If I
 invoke the same code from the command line, it works ok, (even though  
 I'm
 using the same string for the index dir).

 Anyone got any ideas? (I want to use 1.3 because I want to exploit  
 some of
 the newer features). Does running from within a web application do
 something strange with the paths, even though the strings I'm using  
 are
 fully qualified?

 Thanks for your help,

 Iain Young
 http://www.microfocus.com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 ___ 
 _
 This e-mail has been scanned for viruses by MCI's Internet Managed  
 Scanning
 Services - powered by MessageLabs. For further information visit
 http://www.mci.com
 ___ 
 _

 ___ 
 _
 This e-mail has been scanned for viruses by MCI's Internet Managed  
 Scanning
 Services - powered by MessageLabs. For further information visit
 http://www.mci.com
 ___ 
 _

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching for -

2003-12-02 Thread Iain Young
Thanks Dmitri :-)

-Original Message-
From: Dmitri Mamrukov [mailto:[EMAIL PROTECTED]
Sent: 02 December 2003 17:18
To: Lucene Users List
Subject: Re: Searching for -


Hi Iain,

There was a discussion about Dash Confusion in QueryParser (search for
t-shirt - with quota symbols! - or Dash Confusion). They suggested to
escape such words by putting quota symbols around them. For instance, ask
your application DISP-NAME instead of DISP-NAME.

Dmitri

- Original Message - 
From: Iain Young [EMAIL PROTECTED]
To: Lucene mailing list (E-mail) [EMAIL PROTECTED]
Sent: Tuesday, December 02, 2003 12:01 PM
Subject: Searching for -


 Hi folks, another newbie question for you.

 I'm using Lucene to index huges chunks of source code, (cobol, jcl, c,
java,
 text documents etc). In some of these languages (such as cobol) it is
valid
 to have a variable name of DISP-NAME for example. The problem I have is
that
 when you enter this search string into the lucene query engine, it reads
the
 - character as a NOT modifier rather than as part of the word, and so
I'm
 getting incorrect results, (it basically does a search for DISP NOT NAME).

 Anyone any ideas as to how to get around this (can you 'escape' the
modifier
 characters so that Lucene doesn't interpret them as such for example)?

 Thanks,
 Iain Young
 http://www.microfocus.com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-01 Thread Iain Young
 Hi folks.
 
 I'm new to Lucene so this may be an obvious questions, but I am having
 problems with Lucene 1.3-rc2. I've got a bit of code which looks something
 like this
 
 public static void getSearchResults(String searchString, String indexDir)
 {
 try
 {
 Searcher searcher = new IndexSearcher(indexDir);
 .
 etc...
 .
 }
 catch (Exception ex)
 {
 }
 }
 
 I'm calling it to from a web application (servlet) running in tomcat in
 conjunction with struts and velocity. If I use the Lucene 1.2 binary
 release, it all works fine and I get the search results ok. However, when
 I replace the 1.2 jar file with the 1.3-rc2 jat file,  (leaving all of my
 code exactly the same) it stops working, and I get a path not found
 exception being thrown. 
 
 I've narrowed it down to the IndexReader.open(final Directory directory)
 method. Even if I pass a valid Directory object into this (created by
 FSDirectory), it just seems to throw the exception, (even though I know
 the directory object is not null etc). The bizarre thing is that this
 problem only seems to occur when I run it from the web application. If I
 invoke the same code from the command line, it works ok, (even though I'm
 using the same string for the index dir).
 
 Anyone got any ideas? (I want to use 1.3 because I want to exploit some of
 the newer features). Does running from within a web application do
 something strange with the paths, even though the strings I'm using are
 fully qualified?
 
 Thanks for your help,
 
 Iain Young
 http://www.microfocus.com
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-01 Thread Iain Young
Note, that I've just tried the example webapp supplied with Lucene, and I
appear to be having exactly the same problem with that. The 1.2 version
works ok, but the 1.3 version is displaying a path not found error.

Are there any known incompatibilities with certain versions of Tomcat (I'm
currently using version 4.0.3)

Thanks,
Iain

-Original Message-
From: Iain Young [mailto:[EMAIL PROTECTED]
Sent: 01 December 2003 15:40
To: '[EMAIL PROTECTED]'
Subject: Help with Searching indexes from a web app (Lucene 1.3 rc2)


 Hi folks.
 
 I'm new to Lucene so this may be an obvious questions, but I am having
 problems with Lucene 1.3-rc2. I've got a bit of code which looks something
 like this
 
 public static void getSearchResults(String searchString, String indexDir)
 {
 try
 {
 Searcher searcher = new IndexSearcher(indexDir);
 .
 etc...
 .
 }
 catch (Exception ex)
 {
 }
 }
 
 I'm calling it to from a web application (servlet) running in tomcat in
 conjunction with struts and velocity. If I use the Lucene 1.2 binary
 release, it all works fine and I get the search results ok. However, when
 I replace the 1.2 jar file with the 1.3-rc2 jat file,  (leaving all of my
 code exactly the same) it stops working, and I get a path not found
 exception being thrown. 
 
 I've narrowed it down to the IndexReader.open(final Directory directory)
 method. Even if I pass a valid Directory object into this (created by
 FSDirectory), it just seems to throw the exception, (even though I know
 the directory object is not null etc). The bizarre thing is that this
 problem only seems to occur when I run it from the web application. If I
 invoke the same code from the command line, it works ok, (even though I'm
 using the same string for the index dir).
 
 Anyone got any ideas? (I want to use 1.3 because I want to exploit some of
 the newer features). Does running from within a web application do
 something strange with the paths, even though the strings I'm using are
 fully qualified?
 
 Thanks for your help,
 
 Iain Young
 http://www.microfocus.com
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]