RE: Lucene 1.4 RC 3 issue with temp directory

2004-05-17 Thread Eric Isakson
Your catalina.bat script is guessing your CATALINA_HOME environment variable since you 
don't have one set and is setting java.io.tmpdir based on that guess. You could work 
around this by setting a CATALINA_HOME environment variable or setting the system 
property org.apache.lucene.lockdir. That doesn't solve the problem for Lucene locks 
when java.io.tmpdir is set to a relative path that does not exist though.

Eric

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 17, 2004 2:15 PM
To: [EMAIL PROTECTED]
Subject: Lucene 1.4 RC 3 issue with temp directory


Hi All,

I just upgraded to 1.4 RC 3 and am now unable to open my index.

I am getting: 
java.io.IOException: The system cannot find the path specified
at java.io.WinNTFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:828)
at org.apache.lucene.store.FSDirectory$1.obtain(FSDirectory.java:297)
at org.apache.lucene.store.Lock.obtain(Lock.java:53)
at org.apache.lucene.store.Lock$With.run(Lock.java:108)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:38)


I _have_ reindexed using the new lucene jar.  I am positive the path is correct as I 
can open an index in the same directory with the old Lucene with no problems.  I 
notice that the problem only occurs when I am deployed inside of Tomcat.  If I run 
searches on the command line or through JUnit everything functions correctly.  

When I print out the lockDir location that is trying to be obtained above, it looks 
like: C:\ENG\index\LDC\trec-ar-dar\..\temp which is the directory my index resides in, 
except ..\temp does not exist.  When I create the directory, it works.  I suppose I 
could create the temp directory for every index, but I didn't know that was a 
requirement.  I do notice that Tomcat has a temp directory at the top, so it is 
probably setting some system property (java.io.tmpdir) variable to ..\temp 
that is being picked up by Lucene?  The question is, what changed in RC 3 that would 
cause this to be used when it wasn't before? 

On a side note, would it be useful to create the lock directory if it doesn't exist?  
If the developers think so, I can submit the patch for it.

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: IndexSearcher on JAR resources?

2004-05-13 Thread Eric Isakson
This isn't exactly what you were asking for and I know it is a somewhat ugly way to 
implement this (violates some OO rules by having knowledge of RAMDirectory internal 
implementation) but I thought it might be of use to some folks and/or might provide a 
starting point for someone else to try and tackle this.

It provides a mechanism for getting a RAMDirectory for an index stored on your 
classpath provided that you know the names of the files that comprise the index using 
ClassLoader.getResource

The test class creates an index, puts it in a jar, adds the jar to a class loader, 
then reads the index from the jar via the class loader into a RAMDirectory.

Eric

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 07, 2004 9:21 AM
To: Lucene Users List
Subject: Re: IndexSearcher on JAR resources?


On May 7, 2004, at 6:14 AM, Edin Pezerovic wrote:
 Hi,
 I found following entry within the mail-archives:

 http://www.mail-archive.com/[EMAIL PROTECTED]/
 msg02129.html


 Is there now (2 years ago) a possibility to have the index within a 
 jar-file?

Someone posted something like this at one point, but ironically I  
cannot _find_ it.  I definitely would be interested in having something  
like this handy.

If anyone has pointers to the implementations posted, please let us  
know.

Thanks,
Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: languages supported by lucene 1.2.1 in eclipse help system

2004-04-27 Thread Eric Isakson
I'm assuming what you have is an eclipse plugin that is making use of the eclipse help 
system. If what you are doing is relying on the lucene eclipse plugin, you may want to 
look at the help system anyway since it will give you an example of an eclipse plugin 
that is using the lucene plugin.

The eclipse help system uses lucene but they have their own Analyzer class that uses 
BreakIterator to identify tokens for languages other than english and german. The 
lucene eclipse plugin just exports the lucene jar and the html parser so that any 
plugin that depends on the lucene plugin (like the help system) will have those jars 
in the classpath of their plugin.

For english they use the PorterStemFilter with a StopAnalyzer and a stopword list. For 
german, they use the GermanAnalyzer supplied by the lucene jar.

In the latest CVS at :pserver:[EMAIL PROTECTED]:/home/eclipse

see the project in org.eclipse.help.base/src/org/eclipse/help/internal/search
in older eclipse versions see the R2_1_maintenance branch of 
org.eclipse.help/src/org/eclipse/help/internal/search

the class DefaultAnalyzer is the analyzer implementation for languages other than 
english and german and WordTokenStream is where they use BreakIterator to break the 
content from the reader into individual tokens.

The default Eclipse help system sets these extensions in the org.eclipse.help.base 
plugin:

!-- Text Analyzers for search --
   extension
 id=org.eclipse.help.base.Analyzer_en
 point=org.eclipse.help.base.luceneAnalyzer
  analyzer
locale=en
class=org.eclipse.help.internal.search.Analyzer_en
  /analyzer
   /extension
   extension
 id=org.eclipse.help.base.Analyzer_de
 point=org.eclipse.help.base.luceneAnalyzer
  analyzer
locale=de
class=org.apache.lucene.analysis.de.GermanAnalyzer
  /analyzer
   /extension

Look at the extension point schema in 
http://dev.eclipse.org/viewcvs/index.cgi/~checkout~/org.eclipse.help.base/schema/luceneAnalyzer.exsd?rev=HEADcontent-type=text/plain
 for how to declare your own analyzer extensions. Beware though, I read that this 
affects all help searches in that language, not just the ones for your plugin.

Also, since the WordTokenStream is in a package with internal in its path, you 
aren't supposed to ever make use of that class from other plugins, so if you wanted 
your own analyzer based on that class and a stop list, you shouldn't use that class 
without talking the eclipse help developers into moving it outside of an internal 
package.

Most of this has been around for a while, so it is probably the same or very similar 
in previous eclipse versions, you may need to poke around at the extension point 
schema in your eclipse plugins directory to verify that the extension point works the 
same way in your version of eclipse. I haven't used it in versions prior to 3.0M8

Hope this is useful to you,
Eric

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Saturday, April 24, 2004 10:18 AM
To: Lucene Users List
Subject: Re: languages supported by lucene 1.2.1 in eclipse help system


That's no myth :)
Core Lucene (even the current version) does not include classes that know how to 
analyze/tokenize text in languages other than English, Russian, and German.  However, 
take a look at the Snowball contributions in Lucene Sandbox, where a few more 
analyzers are available, including those for CJK group of langauges.

Otis


--- Jason Elliott [EMAIL PROTECTED] wrote:
 We have a plugin in our eclipse project named org.apache.lucene_1.2.1.
 It works quite well in that help system.
  
 I've been notified that this particular version of the lucene search 
 analyzer searches well in German and English (GE), but not so well in 
 the rest of the languages on this planet.
  
 I have several questions
 1.If it does not search very well in French, Italian and Japanese
 (FIJ), what does that really mean to a user conducting searches?
 a.If this is a myth and the searches work the same in EFIG-J, please
 let me know that.
 b.If this is not a myth and there are plugins that enable the search
 to work well in FIJ?
  
 Thanks
 jason
  
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searches containing a dollar sign $

2004-03-18 Thread Eric Isakson
I think Erik Hatcher commented on a similar problem the other day. When QueryParser 
handles a * query which it creates as a prefix query, the token the prefix query is 
built from is not analyzed.

StandardAnalyzer would turn abc$def into two tokens abc and def

QueryParser would take query 2 and build a PrefixQuery with abc as the prefix and 
query 3 as a PrefixQuery with abc$ as the prefix.

There are probably a million valid reasons why this is appropriate default behavior 
for QueryParser. One off the top of my head is that with a stemming analyzer, you may 
not get an approriate stem if you analyzed the prefix. In this case, if this is not 
appropriate behavior for your application, you should probably create a custom query 
parser with different behavior.

Eric

Here is the snip of QueryParser.jj that builds the query objects. The only one that is 
analyzed is the field query. The term productions generally break on whitespace and 
special unescaped query operators (see the .jj file for the full details):

   term=TERM
   | term=PREFIXTERM { prefix=true; }
   | term=WILDTERM { wildcard=true; }
   | term=NUMBER
 )
 [ FUZZY { fuzzy=true; } ]
 [ CARAT boost=NUMBER [ FUZZY { fuzzy=true; } ] ]
 {
   String termImage=discardEscapeChar(term.image);
   if (wildcard) {
   q = getWildcardQuery(field, termImage);
   } else if (prefix) {
 q = getPrefixQuery(field,
   discardEscapeChar(term.image.substring
  (0, term.image.length()-1)));
   } else if (fuzzy) {
 q = getFuzzyQuery(field, termImage);
   } else {
 q = getFieldQuery(field, analyzer, termImage);
   }
 }

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 18, 2004 11:44 AM
To: Lucene Users List
Subject: Re: Searches containing a dollar sign $


Are you indexing your documents with the same Analyzer?
Are you using QueryParser?
Are you able to get query 3) to work when using queries directly, without a 
QueryParser?

Otis

--- Reece [EMAIL PROTECTED] wrote:
 Hi,
 
 I have a field that has a dollar sign in it like this:
   abc$def
 
 I perform the following queries using the
 StandardAnalyzer:
 
 1). myField:abc$def - work
 2). myField:abc*- work
 3). myField:abc$*   - no work
 
 Why doesn't the third query work?  Is there an
 analyzer that will handle all three of these queries?
 
 Thanks,
 Reece
 
 __
 Do you Yahoo!?
 Yahoo! Mail - More reliable, more storage, less spam 
 http://mail.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Japanese Analyzer

2004-01-30 Thread Eric Isakson
I've been using the CJKAnalyzer for a while now and our native japanese speaking 
development staff haven't had any complaints with the results they are getting in 
their searches.

Just be sure you get all the character encoding issues straight. One of the gotchas I 
ran into when I first started working with this was improper character encoding 
handling in my web application.

Eric

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 29, 2004 1:46 PM
To: Lucene Users List
Subject: Re: Japanese Analyzer


I think that's the only one we've got.
You can browse the Lucene Sandbox contributions directory, it's there.

Otis

--- Weir, Michael [EMAIL PROTECTED] wrote:
 Is the CJKAnalyzer the best to use for Japanese?  If not, which is?
 If so,
 from where can I download it?
 Thanks.
 
 Michael Weir . Transform Research Inc. . 613.238.1363 x.114
 
 This message may contain privileged and/or confidential information.
 If you
 have received this e-mail in error or are not the intended recipient,
 you
 may not use, copy, disseminate or distribute it; do not open any
 attachments, delete it immediately from your system and notify the
 sender
 promptly by e-mail that you have done so.  Thank you.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Keyword search with space and wildcard

2003-09-02 Thread Eric Isakson
Not sure about documented examples, but I often find the unit tests (in src/test of 
lucene's CVS) to be very useful  for examples but I didn't see any for what you are 
looking for.

Basically, query parser builds up a vector of BooleanClause objects then loops over 
those on a BooleanQuery object calling add(BooleanClause). I agree JavaCC isn't really 
simple to follow, but there is a lot of plain java in there that does the parts you 
are interested in and if you build the .java file and ignore the token parsing stuff, 
you can look at in your favorite java IDE.

What you can do is cast the query you get from QueryParser to a BooleanQuery (that is 
the only type of Query that QueryParser will return) then create your WildcardQuery or 
any other queries you need that you didn't get in the query string and add them as 
clauses to the BooleanQuery using add(Query query, boolean required, boolean 
prohibited).

I don't know how query combine works (never used it), but the javadoc comment leads me 
to believe it is not what you are looking for and a bit of poking around in the 
sources gives me the same impression.

Eric 

-Original Message-
From: Brian Campbell [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 02, 2003 11:05 AM
To: [EMAIL PROTECTED]
Subject: Re: Keyword search with space and wildcard


Great.  Is there an example anywhere on how I might be able to build such a 
Query?  QueryParser isn't really all that simple since it's built with 
JavaCC.

What might be ideal for me is if I can continue to use the highlevel 
interface to build the main query (ie use it to parse my query string and 
return me some kind of Query - BooleanQuery, TermQuery, etc) and then build 
a WildcardQuery by hand and combine the two together?  For example, is it 
as simple as calling Query.combine() to combine the two?  Is there a better 
way?  Is there a documented example like this?  Thanks!

-Brian





This can be done, AFAIK.

This is one thing that many people seem unaware of: you don't HAVE to 
use QueryParser to build queries. In your case it seems like you should 
be able to construct query you want if you either by-pass QueryParser, 
or create a dummy analyzer (one that does no tokenization but returns 
all input as one token).


_
Enter for your chance to IM with Bon Jovi, Seal, Bow Wow, or Mary J Blige 
using MSN Messenger http://entertainment.msn.com/imastar


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: doc number as integer

2003-08-27 Thread Eric Isakson
I remember this coming up before...long causes thread saftey issues...

http://www.javaworld.com/javaworld/jw-09-1997/jw-09-raceconditions.html

I couldn't find anything on sun's java site to reference, but I didn't look to hard.

Eric

-Original Message-
From: Neil [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 27, 2003 1:40 PM
To: [EMAIL PROTECTED]
Subject: doc number as integer


It seems that since the index document number value is a positive int, this restricts 
the number of documents in an index to ( 2^31 - 1 ) = 2,147,483,647.

Do I misunderstand?

I mean, that's enough for me, but it seems a kind of surprising restriction, 
considering long could be used instead for unimaginably large numbers of documents.  
Well, I grant I probably can't imagine 2 billion documents either, but google can.

Just curious, sorry to bother anyone.


Neil 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: javacc problem + path/link problem in html demo

2003-08-01 Thread Eric Isakson
JavaCC 3 is not supported by ant yet...
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19468
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=763762
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=774059

Eric

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 01, 2003 2:31 PM
To: [EMAIL PROTECTED]
Subject: javacc problem + path/link problem in html demo



I get the error, when I run ant, that it won't build.

why am I building?  when I run the web demo, all the links are formed with luceneweb/ 
preceding them (the links are incorrect):

and the links come out as:

http://localhost:8080/luceneweb/examples/foo.jsp

when it should be:

http://localhost:8080/examples/foo.jsp

and I'm using tomcat, btw.

I hunted down the line that gets the path in HTMLDocument (in the demo), and added 
some scaffolding to see what it says the link is; and so I wanted to recompile it (the 
thought is, that I could do a substring on the path, if is indeed adding luceneweb/ as 
part of all the paths). (it's a bit of a hack, but if it would work)

anyways, I downloaded javacc and am trying to build, with no avail.  I've read through 
the newsgroup archives, read the help files, and looked on the net...so here I am 
emailing the group

thanks so much.

some more detail:
ant can't find javacc -  (also, it wants javacc.zip; but the javacc distrib. I got 
only comes with javacc.jar)

from my default.properties file (I added this myself):

# Home directory of JavaCC
javacc.home =   c:/Java tools/javacc-3.1/
javacc.zip.dir = ${javacc.home}/bin/lib
javacc.zip = ${javacc.zip.dir}/javacc.jar

(the above snippet seems to do no good :(

-Jill




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: javacc problem + path/link problem in html demo

2003-08-01 Thread Eric Isakson
JavaCC 2 is no longer available. You will have to upgrade or dig for it (i.e. you and 
whomever you get it from would be violating the license agreement).

Eric

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 01, 2003 2:40 PM
To: Lucene Users List
Subject: RE: javacc problem + path/link problem in html demo



I saw that bug, but tried it anyways..


but since the bug is still active, do you know where I can an earlier copy of javacc? 
(and which version, exactly, that I need?)

thanks

-Jill



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: JavaCC v3 and Lucene

2003-07-21 Thread Eric Isakson
if you go to the bug and right click and Save Target As... (in IE, not sure what the 
Netscape equivalent is) on the link for the attachment:

 06/14/03 01:23 javacc3-ant-support.jar to be added to /lib   
(application/octet-stream) 

then save it as javacc3-ant-support.jar into your Lucene /lib directory.

Then save this other attachment (it is a patch file).

 06/14/03 02:39 Complete Patch including refactoring the javacc tasks out of the 
compile target   (text/plain) 

then apply this patch. Not sure what tools you can use to do that, I use the Team 
support in Eclipse www.eclipse.org (Team-Apply Patch).

I noticed a day or two ago that the build.xml diff is a little bit out of synch with 
current CVS, so you may need to look at that some. I started fixing up a new patch but 
haven't gotten enough free time to fix it yet.

Eric

-Original Message-
From: Liliya Kharevych [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 21, 2003 6:56 PM
To: [EMAIL PROTECTED]
Subject: RE: JavaCC v3 and Lucene



Hi,

I was trying to build Lucene with JavaCC 3.0 and completly got lost.

Sorry about the dummy question, but where can I download the patch?

I tried the bug URL, and was able to download JavaCC_3.java, but the last attachment 
is this big text file and I cannot figure out what to do with it.

As I understand build.xml should be changed and javacc3-ant-support.jar should be 
somewhere but I cannot find it.

Thanks,
lily


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: CJK support in lucene

2003-07-16 Thread Eric Isakson
This archived message has the CJKTokenizer code attached (there are some links in the 
code to material that describes the tokenization strategy).

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=330905

You have to write your own analyzer that uses this tokenizer. See 
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to 
write an analyzer.

here is one you could use:
package my.package;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKTokenizer;
import java.io.Reader;

public class CJKAnalyzer extends Analyzer {

public CJKAnalyzer() {
}

/**
 * Creates a TokenStream which tokenizes all the text in the provided Reader.
 *
 * @return  A TokenStream built from a CJKTokenizer
 */
public TokenStream tokenStream( String fieldName, Reader reader )
{
TokenStream result = new CJKTokenizer( reader );
result = new StopFilter(result, new String[] {}); // CJKTokenizer 
emitts a  sometimes, haven't been able to figure it out, so this is a workaround
return result;
}
}

Lastly, you have to package those things up and use them along with the core lucene 
code.

CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on 
indexing CJK languages would be a good thing to add. The existing one 
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q28)
 is somewhat light on details (so is this answer, but it is a bit more direct about 
dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=108 is useful to be 
aware of too.

Good luck,
Eric

-Original Message-
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:06 PM
To: Eric Isakson
Subject: CJK support in lucene



Hi Eric,

I read the description of the bug (#18933) reported by you on the apache site. I had a 
question related to this defect. In the description you have mentioned that CJK 
support should be included in the core build. Is there any other way we can enable the 
CJK support in the lucene search engine? Would be grateful to you if you could let me 
know of any such method of enabling CJK support in the serach engine.

Eagerly waiting for your reply.

Thanks  Regards,
Avnish Midha
Phone no.: +1-949-8852540





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FW: CJK support in lucene

2003-07-16 Thread Eric Isakson


-Original Message-
From: Eric Isakson 
Sent: Wednesday, July 16, 2003 2:04 PM
To: 'Avnish Midha'
Subject: RE: CJK support in lucene


I'm no linguist, so the short answer is, I'm not sure about Taiwanese. If they share 
the same character sets and a bigram indexing approach makes sense for that language 
(read the links in the CJKTokenizer source), then it would probably work.

For Latin-1 languages, it will tokenize (It is setup to deal with mixed language 
documents where some of the text might be Chinese and some might be English) but it 
will be far less efficient than the standard tokenizer supplied with the Lucene core. 
But you should run your own tests to see if that would be livable.

Eric

-Original Message-
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:50 PM
To: Eric Isakson
Cc: Lucene Users List
Subject: RE: CJK support in lucene



Eric,

Does this tokenizer also support Taiwanese  European languages (Latin-1)?

Regards,
Avnish

-Original Message-
From: Eric Isakson [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 16, 2003 10:38 AM
To: Avnish Midha
Cc: Lucene Users List
Subject: RE: CJK support in lucene


This archived message has the CJKTokenizer code attached (there are some links in the 
code to material that describes the tokenization strategy).

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
e.orgmsgId=330905

You have to write your own analyzer that uses this tokenizer. See 
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to 
write an analyzer.

here is one you could use:
package my.package;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKTokenizer;
import java.io.Reader;

public class CJKAnalyzer extends Analyzer {

public CJKAnalyzer() {
}

/**
 * Creates a TokenStream which tokenizes all the text in the provided Reader.
 *
 * @return  A TokenStream built from a CJKTokenizer
 */
public TokenStream tokenStream( String fieldName, Reader reader )
{
TokenStream result = new CJKTokenizer( reader );
result = new StopFilter(result, new String[] {}); // CJKTokenizer 
emitts a  sometimes, haven't been able to figure it out, so this is a workaround
return result;
}
}

Lastly, you have to package those things up and use them along with the core lucene 
code.

CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on 
indexing CJK languages would be a good thing to add. The existing one 
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.index
ingtoc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more 
direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=108 is 
useful to be aware of too.

Good luck,
Eric

-Original Message-
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:06 PM
To: Eric Isakson
Subject: CJK support in lucene



Hi Eric,

I read the description of the bug (#18933) reported by you on the apache site. I had a 
question related to this defect. In the description you have mentioned that CJK 
support should be included in the core build. Is there any other way we can enable the 
CJK support in the lucene search engine? Would be grateful to you if you could let me 
know of any such method of enabling CJK support in the serach engine.

Eagerly waiting for your reply.

Thanks  Regards,
Avnish Midha
Phone no.: +1-949-8852540




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: '-' character not interpreted correctly in field names

2003-07-09 Thread Eric Isakson
You left out the ~ character in your _FIELDNAME_START_CHAR production. That character 
tells the grammar that it should take all the characters except the ones you specified 
(the complement).

Change:

| #_FIELDNAME_START_CHAR: ( [  , \t, +, -, !, (, ), :, 

To:

| #_FIELDNAME_START_CHAR: ( ~[  , \t, +, -, !, (, ), :, 

and it should probably work.

Eric

-Original Message-
From: Victor Hadianto [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 09, 2003 4:53 AM
To: Lucene Users List
Subject: Re: '-' character not interpreted correctly in field names


Hi Erik and others,

I'm looking for a similar solution where I need QueryParser not to drop the 
- characters from the field name. Hower outside the field I do want the - 
sign interpreted as not modifier. 

I'm definitely not an expert in JavaCC and to be honest I only have a limited 
idea about Erik's suggestion work,

Anyway I followed the suggestion and added the following:

| #_WHITESPACE: (   | \t ) 
| #_FIELDNAME_START_CHAR: ( [  , \t, +, -, !, (, ), :, 
| ^,
   [, ], \, {, }, ~, *, ? ]
 | _ESCAPED_CHAR ) 
| #_FIELDNAME_CHAR: ( _FIELDNAME_START_CHAR | _ESCAPED_CHAR ) 

and again below I added:


| TERM:  _TERM_START_CHAR (_TERM_CHAR)*  
| FIELDNAME: _FIELDNAME_START_CHAR (_FIELDNAME_CHAR)*  

And I changed:

LOOKAHEAD(2)
fieldToken=TERM COLON { field = fieldToken.image; }

to: ...

LOOKAHEAD(2)
fieldToken=FIELDNAME COLON { field = fieldToken.image; }


Well after doing all this mods all the query that involved field names cause 
problem, for example if I searched for

fieldname:hello

The query is blank (yes blank, nothing in it)

and if the fieldname does contain a dash (-) for example: field-name:hello

They query is: +field -name

hello is dropped.


Does anyone has any idea? Help and suggestions will be much appreciated. I 
really need to get this dash working, changing the field name will be my last 
resort which I won't explore until I really have to.


Thanks,

Victor


On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
 I think the query parser changes would not be too bad, I've outlined a 
 couple of relavant lines you should look at so you don't have to try 
 and comprehend the productions for the entire QueryParser. I do not 
 think I would like to have to maintain one of those myself though. 
 Your other unmentioned alternative is to choose field names that match 
 the TERM production of QueryParser.jj without escapes.

 QueryParser.jj line 557:
 fieldToken=TERM COLON { field = fieldToken.image; }

 and earlier...
  #_ESCAPED_CHAR: \\ [ \\, +, -, !, (, ), :, ^,
   [, ], \, {, }, ~, *, ? ] 

 | #_TERM_START_CHAR: ( ~[  , \t, +, -, !, (, ), :, 
 | ^,

[, ], \, {, }, ~, *, ? ]

| _ESCAPED_CHAR ) 
 |
 | #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR ) 

 ...

 TERM:  _TERM_START_CHAR (_TERM_CHAR)*  

 So the characters you need to avoid in your field names are the ones 
 from _ESCAPED_CHAR, [ \\, +, -, !, (, ), :, ^, [, 
 ], \, {, }, ~, *, ? ]

 If you need to modify the parser, you will probably want to add a 
 FIELDNAME token and other supporting productions that look really 
 similar to these lines I've copied but modify the complement, ~[...], 
 at the beginning of _FIELDNAME_START_CHAR (you would add this 
 production) so it will match the - that you are using in your field 
 names (and fix it to match any other characters you want to use in 
 field names that it doesn't allow right now).

 Eric

 -Original Message-
 From: Jon Pipitone [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, May 14, 2003 2:26 PM
 To: Lucene Users List
 Subject: Re: '-' character not interpreted correctly in field names

 Eric Isakson wrote:
  I just looked at the QueryParser.jj code, your field names
 
   never get processed by the analyzer. It does look like the   query 
 parser will honor escapes though. I haven't tried   this, but try a 
 query like foo\-bar:foo and have
 
  a look at the QueryParser.jj file for how it handles field
 
   names when parsing your query.

 Hrm.. that's what I had found too.  So, you're saying that, other than 
 escaping dashes, I'd have to change QueryParser.. ?

 I'm not too familiar just yet with JavaCC syntax, so reading through 
 QueryParser is a little tough going.  Thanks Eric,

 jp

  -Original Message-
  From: Jon Pipitone [mailto:[EMAIL PROTECTED]
  Sent: Monday, May 12, 2003 4:03 PM
  To: Lucene Users List
  Subject: Re: '-' character not interpreted correctly in field names
 
 
  Hi Otis, Terry,
 
   You can write a custom Analyzer that does not remove dashes from
  
  tokens, and use it for both indexing and searching.This
 
  is a frequent question and answer on this list.
 
  Sorry for the noise, but I haven't been able to find a solution in 
  the mailing list archives, or by writing my own analyzer:
 
  public class MyAnalyzer

RE: Querying Question

2003-04-03 Thread Eric Isakson
This query.toLowerCase() lowercased your query to become:

name:\checkpoint\ and  value:\filenane_1\

The keyword AND must be uppercase when the query parser gets a hold of it.

If your RepositoryIndexAnalyzer lowercases its tokens you don't need to do 
query.toLowerCase(). If it doesn't lowercase its tokens, you may want to modify it so 
that it does.

Eric

-Original Message-
From: Rob Outar [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 03, 2003 5:11 PM
To: Lucene Users List
Subject: Querying Question
Importance: High


Hi all,

I am a little fuzzy on complex querying using AND, OR, etc..  For example:

I have the following name/value pairs

file 1 = name = checkpoint value = filename_1
file 2 = name = checkpoint value = filename_2
file 3 = name = checkpoint value = filename_3
file 4 = name = checkpoint value = filename_4

I ran the following Query:

name:\checkpoint\ AND  value:\filenane_1\

Instead of getting back file 1, I got back all four files?

Then after trying different things I did:

+(name:\checkpoint\) AND  +(value:\filenane_1\)

it then returned file 1.

Our project queries solely on name value pairs and we need the ability to query using 
AND, OR, NOTS, etc..  What the correct syntax for such queries?

The code I use is :
 QueryParser p = new QueryParser(,
 new RepositoryIndexAnalyzer());
 this.query = p.parse(query.toLowerCase());
 Hits hits = this.searcher.search(this.query);

Thanks as always,

Rob



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing and searching non-latin languages using utf-8

2003-03-18 Thread Eric Isakson
Have you verified that your form inputs are getting to your query objects without the 
String being mangled due to encoding problems?

I'm getting japanese in UTF-8 and use the technique described at 
http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html to get the data from the 
browser to Lucene. I build my index using the HTMLParser in the lucene demos and give 
them a Reader object that was created from an InputStreamReader that specifies the 
HTML file encodings (Shift_jis in my case).

There are a bunch of other issues I'm working on to support Japanese, but I'm getting 
search results at this point.

The two places that encodings should come into play for you are parsing your source 
content into the Reader or String that you use to create 
org.apache.lucene.document.Field objects and getting the user query from their browser 
to the Query objects.

Eric
--
Eric D. IsaksonSAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies   Cary, NC 27513
(919) 531-3639 http://www.sas.com



-Original Message-
From: MERCIER ALEXANDRE [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 18, 2003 11:36 AM
To: [EMAIL PROTECTED]
Subject: Indexing and searching non-latin languages using utf-8


Hi all,

I've a matter with indexing then searching docs written in non-latin languages and 
encoded in utf-8 (Russian, by example).

I have a web application, with a simple form to search in the contents of the docs. 
When I submit the form, I encode the query term in utf-8 with
encodeURI(String) but I match no doc. I think that is due to a bad indexing but I'm 
not sure.

Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, encoding it 
in utf-8, I believe. So when it reads the file, it correctly gets russian characters 
(2 bytes) but when writing them in the index, they seem different (I've listed the 
terms in my application console).

If someone has a solution to resolve my problem, all advices are welcome.

Thanks.
Alex


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing and searching non-latin languages using utf-8

2003-03-18 Thread Eric Isakson
There are a bunch of other issues... I should have qualified that. There really aren't 
any issues with the Lucene core to support Japanese, just other issues in my app that 
uses Lucene and working with my content providers to ensure consistent use of 
encodings, etc.

I have found what I think is a bug in the CJKTokenizer in that it emits an empty 
string token after processing my japanese characters. I haven't found the bug in 
CJKTokenizer yet, but as a workaround I'm using a StopFilter that removes it.

-Original Message-
From: Eric Isakson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 18, 2003 11:52 AM
To: Lucene Users List
Subject: RE: Indexing and searching non-latin languages using utf-8


Have you verified that your form inputs are getting to your query objects without the 
String being mangled due to encoding problems?

I'm getting japanese in UTF-8 and use the technique described at 
http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html to get the data from the 
browser to Lucene. I build my index using the HTMLParser in the lucene demos and give 
them a Reader object that was created from an InputStreamReader that specifies the 
HTML file encodings (Shift_jis in my case).

There are a bunch of other issues I'm working on to support Japanese, but I'm getting 
search results at this point.

The two places that encodings should come into play for you are parsing your source 
content into the Reader or String that you use to create 
org.apache.lucene.document.Field objects and getting the user query from their browser 
to the Query objects.

Eric
--
Eric D. IsaksonSAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies   Cary, NC 27513
(919) 531-3639 http://www.sas.com



-Original Message-
From: MERCIER ALEXANDRE [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 18, 2003 11:36 AM
To: [EMAIL PROTECTED]
Subject: Indexing and searching non-latin languages using utf-8


Hi all,

I've a matter with indexing then searching docs written in non-latin languages and 
encoded in utf-8 (Russian, by example).

I have a web application, with a simple form to search in the contents of the docs. 
When I submit the form, I encode the query term in utf-8 with
encodeURI(String) but I match no doc. I think that is due to a bad indexing but I'm 
not sure.

Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, encoding it 
in utf-8, I believe. So when it reads the file, it correctly gets russian characters 
(2 bytes) but when writing them in the index, they seem different (I've listed the 
terms in my application console).

If someone has a solution to resolve my problem, all advices are welcome.

Thanks.
Alex


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multi Language support

2003-03-06 Thread Eric Isakson
Hi Günter,

I had a similar requirement for my use of Lucene. We have documents with mixed 
languages, some of the text in the user's native language and some in English. We made 
the decision to not use any of the stemming analyzers and index with no stop words (I 
didn't like the no stop words decision, but it wasn't really my call). My analyzer 
tokenStream method:

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
return result;
}

Do you really need stemming in your application? Do you really need stop words?

See this note http://archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=653731 for 
a discussion about the advantages/disadvantages of stemming.

If you still want stop words, you can create a list that includes words from more than 
one language, then use the same analyzer for all of your content.

If you still need stemming, you will probably have to give your user the ability to 
tell you which language index they wish to search and you would probably be better off 
maintaining separate indices for each language at that point.

Best of luck,
Eric


-Original Message-
From: Günter Kukies [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 06, 2003 2:08 AM
To: Lucene Users List
Subject: Multi Language support


Hello,

that is what I know about indexing international documents:

1. I have a language ID
2. with this ID I choose an special Analzer for that language 
3. I can use one index for all languages

But what about searching for international documents?

I don't have a language ID, because the user is interested in documents with his 
native language and a second language mostly english. So, what Analyzer do I use for 
searching?


Thanks

Günter

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Phrase query and porter stemmer

2003-02-13 Thread Eric Isakson
Ramesh,

I haven't examined the code closely that does this positioning, but this is how I 
believe it works:

Let say you had a token stream that returned the tokens you, are, running, 
faster, than, me that didn't do any setPositionIncrement calls. The default 
increment is 1.

Each token in the stream gets a position that allows you do things like proximity 
searches the query are than~3 would find the document that token stream came from 
since are occurrs at position 2 and than at position 5 and 5-2 = 3.

Now lets say you wanted to stem running to run but keep the original token. You 
would create a token filter that inserted the stem run into the token stream when 
the running token occurred but also kept the original token running. If you didn't 
set the position increment on the second token then the distance between are and 
than would become 6-2 = 4 which is greater than 3 and your proximity query would 
fail.

When you set the position increment to zero for the added token it gets treated like 
it is at the same position as the original token which prevents you from breaking your 
proximity query.

Proximity queries are the place I know this affects. I'm unsure how the positions 
affect other parts Lucene.

Hope I got all that right and that it helps you understand the setPositionIncrement.

Eric

-Original Message-
From: Mailing Lists Account [mailto:[EMAIL PROTECTED]]
Sent: Thursday, February 13, 2003 7:07 AM
To: Lucene Users List
Subject: Re: Phrase query and porter stemmer


Hi Eric,

Thanks for the reply.  The option of custom token filter sounds good to
me. I am not sure what is the
advantage of Token.setPositionIncrement() option. Let me look into the
docs before I ask further
questions on this.

regards
Ramesh

Eric Isakson wrote:
 You won't get hits for security if you do not use the stemmer. The
 stem of security is the token that gets stored in the index.

 If you don't use the stemming algorithm when you create the index you
 could search for security and only get those documents that contain
 security.

 See the FAQ

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi
 ngtoc=faq#q15

 If you have a list of terms you want to treat differently (i.e. you
 know there are certain words you don't want to stem) you could build
 a custom TokenFilter that checks the tokens for those words before
 applying the stemming algorithm then add that TokenFilter to your
 analyzer. You might also consider allowing the tokens to be stemmed
 and adding the original non-stemmed term at the same position using
 Token.setPositionIncrement(0), you might also want to figure out some
 way to boost the score on those non-stemmed tokens when you build
 your query (not sure how you might accomplish that, but some custom
 query parsing code could do the trick).

 Eric

 -Original Message-
 From: Mailing Lists Account [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, February 12, 2003 4:17 AM
 To: [EMAIL PROTECTED]
 Subject: Phrase query and porter stemmer


 Hi,

 I use PorterStemmer with my analyzer for indexing the documents.
 And I have been using the same analyzer for searching too.

 When I search for a phrase like security AND database, I would like
 to avoid matches for
 terms like secure or securities .  I observed that Google and
 couple of search engines do
 not return such matches.

 1) In otherwords, in a single query, is it possible not to choose
 porter stemmer for phrase queries and
 use for other queries (such as Term query etc)

 2) As an alternative, is it advisable to manually construct a
 PhraseQuery by adding terms without appling porter
stemmer ?

 regards
 Ramesh



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: Phrase query and porter stemmer

2003-02-12 Thread Eric Isakson
You won't get hits for security if you do not use the stemmer. The stem of 
security is the token that gets stored in the index.

If you don't use the stemming algorithm when you create the index you could search for 
security and only get those documents that contain security.

See the FAQ 
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q15

If you have a list of terms you want to treat differently (i.e. you know there are 
certain words you don't want to stem) you could build a custom TokenFilter that checks 
the tokens for those words before applying the stemming algorithm then add that 
TokenFilter to your analyzer. You might also consider allowing the tokens to be 
stemmed and adding the original non-stemmed term at the same position using 
Token.setPositionIncrement(0), you might also want to figure out some way to boost the 
score on those non-stemmed tokens when you build your query (not sure how you might 
accomplish that, but some custom query parsing code could do the trick).

Eric

-Original Message-
From: Mailing Lists Account [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 12, 2003 4:17 AM
To: [EMAIL PROTECTED]
Subject: Phrase query and porter stemmer


Hi,

I use PorterStemmer with my analyzer for indexing the documents.
And I have been using the same analyzer for searching too.

When I search for a phrase like security AND database, I would like to
avoid matches for
terms like secure or securities .  I observed that Google and couple of
search engines do
not return such matches.

1) In otherwords, in a single query, is it possible not to choose porter
stemmer for phrase queries and
use for other queries (such as Term query etc)

2) As an alternative, is it advisable to manually construct a PhraseQuery by
adding terms without appling porter
   stemmer ?

regards
Ramesh



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: Is the searched string 'on' a special case ?

2003-01-13 Thread Eric Isakson
Assuming you are using StandardAnalyzer, the default stop words are:

public static final String[] STOP_WORDS = {
a, and, are, as, at, be, but, by,
for, if, in, into, is, it,
no, not, of, on, or, s, such,
t, that, the, their, then, there, these,
they, this, to, was, will, with
};

Your state field must not be built with StandardAnalyzer or ON would have been 
removed by the analyzer when you created the field. It looks like you will need to use 
lower level APIs than QueryParser to create your Query object or don't use the default 
stop words.

Eric

-Original Message-
From: Alain Lauzon [mailto:[EMAIL PROTECTED]]
Sent: Monday, January 13, 2003 1:23 PM
To: Lucene Users List
Subject: Is the searched string 'on' a special case ?


I have an index wtih many fields, and specially, one for company name and 
one for state.

When I search for :
+company:inc~100

I get 114 results from 2 states, HI (Hawaii) and ON (Ontario).


If I search for :
+state:hi +company:inc~100

I get 7 results for Hawaii.


But when I search for:
+state:on +company:inc~100

I get no results at all for Ontario.

So what is going on ? I tried with many other states and all are working, 
but not 'on'.
Is 'on' a special case ? Like on/off ?

Alain Lauzon
[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Query grouping

2003-01-06 Thread Eric Isakson
Abhay,

This query is processed by the query parser...

((+SEARCH_NAME:dvd +SEARCH_NAME:cd ) OR (+DEF_DOC_FIELD:dvd 
+DEF_DOC_FIELD:cd )) AND ((-SEARCH_NAME:player) OR (-DEF_DOC_F
IELD:player))

and comes out looking like...
+((+SEARCH_NAME:dvd +SEARCH_NAME:cd) (+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd)) 
++((-SEARCH_NAME:player) (-DEF_DOC_FIELD:player))
Using org.apache.lucene.search.Query.toString(String fieldName)

I use this representation as it shows me what happened after my query was processed by 
the QueryParser and Analyzer, so stop words would be removed and case modified if the 
analyzer does such things.


This part...
+((+SEARCH_NAME:dvd +SEARCH_NAME:cd) (+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd))
will produce a set of documents as hits that have the dvd and cd terms in those 
fields

This part...
+((-SEARCH_NAME:player) (-DEF_DOC_FIELD:player))
will always produce an empty set

when the two sets are joined with an intersection, you will always get an empty set

The problem is that when using NOT or - operator, it excludes documents from the set 
of found documents not the set of all documents. This is correct Lucene behavior. So, 
since their are no found documents in that required part of the query, your results 
will always be no hits.

This is mentioned in the jGuru FAQ at http://www.jguru.com/faq/view.jsp?EID=593598

Rearranging the query the way you mentioned is the correct way to deal with this.

Eric

-Original Message-
From: Abhay Saswade [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 03, 2003 9:07 PM
To: [EMAIL PROTECTED]
Subject: Re: Query grouping
...

However, when I try to do this in a single query by grouping I get no result
((+SEARCH_NAME:dvd +SEARCH_NAME:cd ) OR (+DEF_DOC_FIELD:dvd 
+DEF_DOC_FIELD:cd )) AND ((-SEARCH_NAME:player) OR (-DEF_DOC_F
IELD:player))

I don't get any results on a single term query like this (and this explains 
why I am not getting any results in above query)
-SEARCH_NAME:player
Is this a known issue?

Is there any way of dealing with above-mentioned problem other than 
rearranging query like this?
(+SEARCH_NAME:dvd +SEARCH_NAME:cd -SEARCH_NAME:player) OR 
(+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd -DEF_DOC_FIELD:player)

Thanks
Abhay






From: Otis Gospodnetic [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Subject: Re: Query grouping
Date: Fri, 3 Jan 2003 12:33:53 -0800 (PST)

Does the '... AND -.' part even make sense?
Why not just  - ?
Also,  AND + doesn't make sense, does it?
+field:term means the term has to be in the result, so AND is not
really needed, is it?
I am not sure if spaces after 'SEARCH_NAME:' make a difference or not

Also, field:term1 field:term2 implies term1 OR term2, so no need for OR
there, especially with +, I think.

Otis


--- Abhay Saswade [EMAIL PROTECTED] wrote:
  I am using lucene release 1.2. I am using StandardAnalyzer. Have
  anybody
  faced this problem?
 
  I get same results when I run following queries
 
  1. (+SEARCH_NAME:jhon  +SEARCH_NAME:joy)  AND -SEARCH_NAME:chan
  2. (+SEARCH_NAME:jhon  AND +SEARCH_NAME: joy)  AND -SEARCH_NAME:chan
  3. (+SEARCH_NAME:jhon  OR +SEARCH_NAME: joy)  AND -SEARCH_NAME:chan
 
  But when I regroup the query by putting brackets around the last term
  like
  mentioned below I don't get any results
 
  1. (+SEARCH_NAME:jhon +SEARCH_NAME: joy)  AND (-SEARCH_NAME:chan)
  2. (+SEARCH_NAME:jhon AND +SEARCH_NAME: joy)  AND (-SEARCH_NAME:chan)
  3. (+SEARCH_NAME:jhon OR +SEARCH_NAME: joy)  AND (-SEARCH_NAME:chan)
 
  This is just an example. I need to do grouping on various fields. Am
  I
  missing something?  Is there any document other than
  http://jakarta.apache.org/lucene/docs/queryparsersyntax.html? Can
  somebody
  throw some light on this?
 
  Thanks,
  Abhay
 
 
 
 
 
 
  _
  MSN 8 with e-mail virus protection service: 2 months FREE*
  http://join.msn.com/?page=features/virus
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]


_
Add photos to your e-mail with MSN 8. Get 2 months FREE*. 
http://join.msn.com/?page=features/featuredemail


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: package information?

2002-12-20 Thread Eric Isakson
I think this info is available via the Manifest that is created during the build. This 
is cut from the build.xml from the latest CVS...

!-- Create Jar MANIFEST file --
echo file=${build.manifest}Manifest-Version: 1.0
Created-By: Apache Jakarta

Name: org/apache/lucene
Specification-Title: Lucene Search Engine
Specification-Version: ${version}
Specification-Vendor: Lucene
Implementation-Title: org.apache.lucene
Implementation-Version: build ${DSTAMP} ${TSTAMP}
Implementation-Vendor: Lucene
/echo

This is only added to the core jar, there is no such Manifest generated for the demo 
jar.

Eric

-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]]
Sent: Friday, December 20, 2002 3:04 PM
To: [EMAIL PROTECTED]
Subject: package information?


Hi,

Would it be possible for Lucene to provide package informations? 
Basically all the java.lang.Package attributes... Things like 
implementation vendor, name, version and so on... This would make it 
easier to identify which packages/versions are used.

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: help w/ phrase query

2002-12-13 Thread Eric Isakson
Dominic,

Are you constructing the PhraseQuery directly using it's add(Term) method to add terms 
to the query? If so, you need to make sure your terms go through the same 
normalization (via the Analyzer) that your content went through when you created your 
index.

So if the field you are querying was created in your index using StandardAnalyzer, the 
terms in your query should also be run through StandardAnalyzer.

Does this help? if not, give us a little more detail about what Analyzer you are using 
to create your index and how you are creating your PhraseQuery object.

Eric

-Original Message-
From: host unknown [mailto:[EMAIL PROTECTED]]
Sent: Friday, December 13, 2002 1:17 PM
To: [EMAIL PROTECTED]
Subject: help w/ phrase query


Hi All.

I'm out of ideas on how to get the PhraseQuery to return any results.  I'm 
guessing I might not be indexing properly when the document data is being 
stored.  Is there any particular Field type that should be used.  I've tried 
both Field.Text(String, String) and Field.Text(String, Reader).

If Field type is irrelevantany pointers on where to look next are 
appreciated.

Dominic
madison.com




_
MSN 8 with e-mail virus protection service: 2 months FREE* 
http://join.msn.com/?page=features/virus


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Accentuated characters

2002-12-10 Thread Eric Isakson
Don't know if any of the code in this French analyzer that was contributed by Patrick 
Talbot may apply, any reason you don't just use it? see 
http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]msgNo=870

Eric
--
Eric D. IsaksonSAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies   Cary, NC 27513
(919) 531-3639 http://www.sas.com


-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 10, 2002 2:58 PM
To: [EMAIL PROTECTED]
Subject: Accentuated characters


Hello everyone,

I wish to implement a TokenFilter that will remove accentuated 
characters so for example 'é' will become 'e'. As I would rather not 
reinvent the wheel, I've tried to find something on the web and on the 
mailing lists. I saw a mention of a contrib that could do this (see 
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
but I don't see anything applicable.

Has anyone done this yet, if so I would much appreciate some pointers 
(or code), otherwise, I'll be happy to contribute whatever I produce 
(but it might be very simple since I'll only need to deal with french).

Cheers,
Stephane


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Accentuated characters

2002-12-10 Thread Eric Isakson
If you really want to make your own TokenFilter, have a look at 
org.apache.lucene.analysis.LowerCaseFilter.next()

it does:
  public final Token next() throws java.io.IOException {
Token t = input.next();

if (t == null)
  return null;

t.termText = t.termText.toLowerCase();

return t;
  }

The termText member of the Token class is package scoped, so you will have to 
implement your filter in the org.apache.lucene.analysis package. No worries about 
encoding as the termText is already a java (unicode) string. You will just have to 
provide the mechanism to get the accented characters converted to there non-accented 
equivalents. java.text.Collator has some magic that does this for string comparisons 
but I couldn't find any public methods that give you access to convert a string to its 
non-accented equivalent.

Eric
--
Eric D. IsaksonSAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies   Cary, NC 27513
(919) 531-3639 http://www.sas.com



-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 10, 2002 2:58 PM
To: [EMAIL PROTECTED]
Subject: Accentuated characters


Hello everyone,

I wish to implement a TokenFilter that will remove accentuated 
characters so for example 'é' will become 'e'. As I would rather not 
reinvent the wheel, I've tried to find something on the web and on the 
mailing lists. I saw a mention of a contrib that could do this (see 
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
but I don't see anything applicable.

Has anyone done this yet, if so I would much appreciate some pointers 
(or code), otherwise, I'll be happy to contribute whatever I produce 
(but it might be very simple since I'll only need to deal with french).

Cheers,
Stephane


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]