Re: Search for documents where field does not exist?

2005-06-20 Thread John Haxby

Erik Hatcher wrote:



On Jun 17, 2005, at 5:54 PM, [EMAIL PROTECTED] wrote:

Please do not reply to a post on the list if your message actually  
isn't a

reply. Post a new message instead.



Sorry about that.. wasn't intentional.. clicked reply to get the reply
address and then forgot to change the subject :)



Even changing the subject after doing a "reply" is not sufficient as  
it will still end up in the same thread erroneously.  You need to  
create a new message to start a new thread.  (certainly this varies  
by mail client, though)


The reason for this is that when you reply to a message you get an 
"in-reply-to" header that refers to the message-id of the original 
header; you may also get a "references" header that performs a similar 
function for other message in the thread.   I haven't come across a mail 
client  that drops the in-reply-to when you change the subject.


If you want to find out exactly what your particular mail client does; 
reply to a message of your own and then look at the message source or 
the original message headers (what you're looking for depends on the 
client, sometimes it's obvious, sometimes its hidden in message properties.)


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems with concurrency using Tomcat

2005-06-20 Thread Fred Vos
On Sun, Jun 19, 2005 at 07:05:06PM -0400, Joseph MarkAnthony wrote:
> 
> I have been struggling with this problem for months and have made no
> progress.
> I have created a simple web app for my Lucene indices that allow the user to
> rebuild or update the index.
> In the *same* web app (perhaps this is important), I have the search module.
> 
> The update module works fine - until someone tries a search.  Then the
> update will no longer be allowed to access the index directory.  The error
> is somethign like "cannot delete _a.cfs" or somethign similar.
> 
> I'm running Tomcat on Win2000, and I've pretty much confirmed that it's a
> WINDOWS locking problem, not a lucene problem per se.  This is the windows
> "Cannot delete, File may be in use" situation that you may often get if
> someone else if using a file and you try to delete it.  You see this
> situation many times outside of Lucene.
> 
> Does anyone know how to solve this?  It seems to be a Tomcat problem because
> I cannot duplicate the error in a standalone Java app.  I've seen a few
> posts on this and the resolution is often to copy the index and then update,
> then copy it back.  This can't be the only way to fix this.

Have you tried to add the -DdisableLuceneLocks=true flag to the RUNJAVA
command of catalina.sh? I had to add it to run servlets accessing lucene, but
I'm not sure what my problem was.

Lines 219-227 of my catalina.sh (running jakarta tomcat 5.0.16 under linux):

  else
"$_RUNJAVA" $JAVA_OPTS $CATALINA_OPTS \
  -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
  -Dcatalina.base="$CATALINA_BASE" \
  -Dcatalina.home="$CATALINA_HOME" \
  -Djava.io.tmpdir="$CATALINA_TMPDIR" \
  -DdisableLuceneLocks=true \
  org.apache.catalina.startup.Bootstrap "$@" start \
  >> "$CATALINA_BASE"/logs/catalina.out 2>&1 &

Fred

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



tried to access field org.apache.lucene.analysis.Token.termText from class org.apache.lucene.analysis.StopFilter

2005-06-20 Thread Xavier Lembo

Hi list,

i'm facing an exception that i don't understand when trying to index a
source with the StandardAlayzer

everything works fine when i use my french AnalyserFr() , but exception
is raised when using the StandardAnalyzer

Can somebody help me find the cause of this error because i really have
no idea where to search to solve this problem?

Thanks

Xavier



here is the exception:

java.lang.IllegalAccessError: tried to access field
org.apache.lucene.analysis.Token.termText from class
org.apache.lucene.analysis.StopFilter
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:99)
at
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:155)
at
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:84)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:410)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:396)
at org.elix.indexer.VeilleIndexer.newIndexVeille(Unknown Source)

and here is the code i use to index my document:


if (langage.equals("fr")){
analyzer = new AnalyserFr();
logger.debug("Indexation de la veille en utilisant les stemmers FR ");
  }
else {
analyzer =  new StandardAnalyzer();
logger.debug("Indexation de la veille en utilisant le stemmers EN
Standard ");
}

IndexWriter writer = null;
try {
writer = new IndexWriter(indexPath, analyzer, createIndex);
logger.info("Ajout de la veille: " +  
rs.getString("titre"));

//On recupere le champ content du document arrivé pour y ajouter un boost
Field content = doc.getField("sourcecontent");
content.setBoost(bcontent);

// on ajoute les colonnes utilisées au document lucene en detachant la
construction
//des champs afin de pouvoir leur assigner un boost

Field ftitre = Field.Text("titre",titre);
ftitre.setBoost(btitre);
Field fresume= Field.Text("resume", resume);
fresume.setBoost(bresume);
Field fdetail= Field.Text("detail", detail);

Field fkeywords = Field.Keyword("keywords", keywords);
fkeywords.setBoost(bkeywords);

doc.add(Field.Keyword("veille_id", id));
doc.add(Field.Keyword("theme_id", theme_id));
doc.add(Field.Keyword("themeLeft", Functions.padString(themeLeft, "0", 6)));
doc.add(Field.Keyword("secteur_id", secteur_id));
doc.add(Field.Keyword("secteurLeft", Functions.padString(secteurLeft,
"0", 6)));
doc.add(fkeywords);
doc.add(Field.Keyword("indexDate", new Date()));
doc.add(ftitre);
doc.add(fresume);
doc.add(fdetail);


//Ajout d'un champ contenant tous les textes
doc.add(Field.UnStored("allcontent", fkeywords + " " + ftitre + " " +
fresume +  " " + fdetail +  " " + content));

// On ajoute notre document a l'index
writer.addDocument(doc);
writer.optimize();
writer.close();

}  catch(Exception ee){
throw ee;
}
finally{
 writer.close();
}
--


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RAMDirectory without positions or frequencies?

2005-06-20 Thread eks dev
Hi,
I have a need for minimum memory footprint of the
index during search (would like to have it in RAM).
Good thing in the story, similarity calculation is not
necessary, only pure boolean model is OK. 
I am sure I have seen somewhere one explanation from
Doug about disabling norms... but cannot find it now.
Could I implement my Similarity where all, idf, tf, 
coord, lengthnorm return constant value... WIll this
disable loading of norms/positions/freqs in memory
during search? Also, if I use RAMDirectory in this
case (loading from FSDirectory) will these elements be
loaded in RAM?

Better expressed: If one needs only boolean model (no
score calculation needed), what shouls I do to
minimize memory usage and optimize speed. 
My collection has a lot of very short documents (a
couple tokens per field), 5-7 fields,  (approx 20Mio).
Indexing is not critical. Memory in runtime is just
about to fit in RAM.


thanks in advance, eksdev




___ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene scoring bounds ??

2005-06-20 Thread Erik Hatcher


On Jun 18, 2005, at 7:39 PM, Paul Libbrecht wrote:
I read the lucene-book about scoring and read a bit of the javadoc  
but I can't seem to find somewhere expectations of the bouds for  
the score value.
I had believe the score would end up between 0 and 1 but I seem to  
keep having values under 0.2. It may be due to my special requests  
but... how can I be sure of this ?


Hits from all non-HitCollector searches are "normalized".  Normalized  
in this sense means that if the top-scoring document scores higher  
than 1.0 it is normalized to 1.0 and that ratio is used to normalize  
all scores.  However, if the top-scoring document is under 1.0, the  
scores are left as-is.


Searches using a HitCollector are always left as-is.

Have a look at IndexSearcher.explain() results for document/query  
combinations to see what is causing the lower than expected scores.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: md5 keyword field issue

2005-06-20 Thread Erik Hatcher


On Jun 19, 2005, at 5:17 AM, [EMAIL PROTECTED] wrote:


Hi there,

i have an index with the following infos in it:
url - keyword - Field("url", this.url, Field.Store.YES,  
Field.Index.UN_TOKENIZED);
md5 - keyword - Field("md5", this.url, Field.Store.YES,  
Field.Index.UN_TOKENIZED);

alt - Field("alt", this.alt, Field.Store.YES, Field.Index.TOKENIZED);

i use it to index my images.
now it happens that the same image (eg: same md5) is used in different
locations (eg: different urls).
filename: mylogo.gif used in
http://site.com/project1/mylogo.gif and also
http://site.com/project2/some_other_bubu/mylogo.gif

the ALT is different (eg: different text)

now on my image search app when i search mylogo i get "several"
results with the same image.

i would like to reduce the nr of results in that way that the md5 is
unique.
Note: i can't delete from the index the 2nd image cause the ALT might
be different, so in general all the properties put together (md5, url,
alt) compose a different "entity".



It seems you have conflicting goals here.  You want (md5, url, alt)  
to be unique in one sense, yet you want md5 itself to be unique in  
another sense.



i bought "Lucene in Action" book, which is a GREAT book.


Thank you!  :)


i was looking into "filters".

i quote: "If all the information needed to perform filtering is in the
index, there is no need to write your own filter because QueryFilter
can handle it."

i can't seem to figure it out, how query filter can help me.

also tried to write my own filter but not that much info on that
direction either.


Filters reduce the search space to a subset of the documents in the  
index.  Which document would you want returned when there are  
multiple documents in the index with the same MD5?  Or do you want to  
cluster them by MD5?


Do you want to cluster them by MD5 perhaps, but still return multiple  
documents back from a search?


I'm not sure if a Filter is the appropriate technique for this  
scenario or not.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: md5 keyword field issue

2005-06-20 Thread catalin-lucene
Monday, June 20, 2005, 3:55:36 PM, Erik Hatcher wrote:
> Filters reduce the search space to a subset of the documents in the
> index.  Which document would you want returned when there are  
> multiple documents in the index with the same MD5?  Or do you want to
> cluster them by MD5?

i think cluster by md5 is more appropriate.

> Do you want to cluster them by MD5 perhaps, but still return multiple
> documents back from a search?

i want to return just the 1st image (the more relevant one). no use to
show duplicates in an image search app.

> I'm not sure if a Filter is the appropriate technique for this  
> scenario or not.

well, i am not sure either.
one solution would be when i iterate through the hits collection and
send them to the webapp, to group them by md5 or some.

is this a good way to do it ?
(the bad thing is i would have to do lots of hits.doc(index) in
advance, to make this group by md5 thing, and if the results are
paginated << which is the case >>, on the 2nd page i would need to
keep in session the last "index" or to recalculate it again.. - oh
nein !:)

in sql this would be:
select distinct md5, url, alt from table group by md5 order by score asc;

if i had the score in the DB (which is not the case).

-- 
Catalin Constantin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Implicit Stopping in StandardTokenizer??

2005-06-20 Thread Max Pfingsthorn
Hi!

I've been trying to make an Analyzer which works like the StandardAnalyzer but 
without stopping. For some reason though, I still don't get words like "is" or 
"a" out of it... I checked with Luke (one doc in one index with the contents 
"hello,this,is,a,keyword,hello!,nicetomeetyou". This should tokenize into 
"hello this is a keyword hello nicetomeetyou", but actually it does "hello 
keyword hello nicetomeetyou". Does anyone know why it drops those extra terms?

Best regards,

Max Pfingsthorn

Hippo  

Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-
[EMAIL PROTECTED] / www.hippo.nl
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[2]: md5 keyword field issue

2005-06-20 Thread Erik Hatcher


On Jun 20, 2005, at 9:38 AM, [EMAIL PROTECTED] wrote:


Monday, June 20, 2005, 3:55:36 PM, Erik Hatcher wrote:


Filters reduce the search space to a subset of the documents in the
index.  Which document would you want returned when there are
multiple documents in the index with the same MD5?  Or do you want to
cluster them by MD5?



i think cluster by md5 is more appropriate.



Do you want to cluster them by MD5 perhaps, but still return multiple
documents back from a search?



i want to return just the 1st image (the more relevant one). no use to
show duplicates in an image search app.


Now you've just said the same conflicting thing a different way.  You  
want to cluster but only return one.  :)


If you only want one image returned, then it seems that only indexing  
the same image once is the way to go.  When you find a duplicate MD5,  
don't index that as a second document.  You will, instead, update the  
document by adding additional ALT text and perhaps the additional URL.


Is there a reason why indexing each unique image (by MD5) is not a  
good way to go in your case?



in sql this would be:
select distinct md5, url, alt from table group by md5 order by  
score asc;


This would give you multiple records for the same MD5.  You said  
above you only want one per MD5.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexHTML.java location

2005-06-20 Thread Erik Hatcher

Brian,

The Lucene demo web application is a basic and woefully under- 
achieving example of Lucene.  My recommendation is to dig under the  
covers a bit deeper and tweak the code to suit your needs, or simply  
borrow enough pieces to learn the API usage.  There is very little  
actual Lucene-using code under the covers of the demo, and the bulk  
of most Lucene-using projects is in the code and interface specific  
to the application rather than in interacting with Lucene.


Erik


On Jun 17, 2005, at 11:14 AM, Brian wrote:


Right Now, I'm just using the compiled version that I
downloaded.
By Default URL, I mean the location of the indexed
files.

I have the sample web project (index.jsp) on server A,
and my indexed files are on server B. Everything works
till I click on the link in the results.jsp page. It
looks one directory higher than where the web project
is. I was hoping to have it look at server B in the
appropriate directory.

Thanks, B

--- "Hondros, Constantine"
<[EMAIL PROTECTED]> wrote:



IndexHTML.java is located in the demo jarfile which
is named differently
depending on whether you built it yourself with Ant
or just downloaded it
ready-jarred.

What do you mean by "default URL"? The location of
the document-set to
index, the location of the document-set to search
(presumably through
index.jsp in the samples), or the URL of the lucene
application as seen
through Tomcat?

-Original Message-
From: Brian [mailto:[EMAIL PROTECTED]
Sent: 17 June 2005 16:48
To: java-user@lucene.apache.org
Subject: IndexHTML.java location


Not sure if this is the right address, to request
this
kind of help, so if it isn't please point me else
where.

Basically I think I have an understanding of how
Lucene works, in general.

I believe I'm at a point where I need to change the
"default" url, so I was planning to make the change
in
the IndexHTML.java file. However I don't know where
that file is located. I've already done the simple,
a
seacrh of my HD, and renaming the files etc...
Hasn't
helped. Any pointers would be appreciated.
Thanks, B



__
Do you Yahoo!?
Yahoo! Mail - Helps protect you from nasty viruses.
http://promotions.yahoo.com/new_mail




-


To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]


--
The contents of this e-mail are intended for the
named addressee only. It
contains information that may be confidential.
Unless you are the named
addressee or an authorized designee, you may not
copy or use it, or disclose
it to anyone else. If you received it in error
please notify us immediately
and then destroy it.





-


To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Implicit Stopping in StandardTokenizer??

2005-06-20 Thread Mike Barry
Max Pfingsthorn wrote:

>Hi!
>
>I've been trying to make an Analyzer which works like the StandardAnalyzer but 
>without stopping. For some reason though, I still don't get words like "is" or 
>"a" out of it... I checked with Luke (one doc in one index with the contents 
>"hello,this,is,a,keyword,hello!,nicetomeetyou". This should tokenize into 
>"hello this is a keyword hello nicetomeetyou", but actually it does "hello 
>keyword hello nicetomeetyou". Does anyone know why it drops those extra terms?
>
>Best regards,
>
>Max Pfingsthorn
>
>Hippo  
>  
>
StandardAnaylzer has a constructor which allows you to send your own
array of stop words. So an array with zero elements should do the
trick:


String[] stopWords= new String[0];
StandardAnalyzer analyzer = new StandardAnalyzer(stopWords);

-MikeB.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[4]: md5 keyword field issue

2005-06-20 Thread catalin-lucene
Monday, June 20, 2005, 5:48:30 PM, Erik Hatcher wrote:
> Now you've just said the same conflicting thing a different way.  You
> want to cluster but only return one.  :)

i think i missunderstood here the Term: cluster.
so yes, i just want one image returned.

> If you only want one image returned, then it seems that only indexing
> the same image once is the way to go.  When you find a duplicate MD5,
> don't index that as a second document.  You will, instead, update the
> document by adding additional ALT text and perhaps the additional URL.

this sounds pretty ok !

> Is there a reason why indexing each unique image (by MD5) is not a  
> good way to go in your case?

>> in sql this would be:
>> select distinct md5, url, alt from table group by md5 order by  
>> score asc;

> This would give you multiple records for the same MD5.  You said  
> above you only want one per MD5.

here i'm afraid you are not correct, because i have GROUP BY MD5
clause which will return no duplicates.

(tested it on mysql)
for the query above.
170 rows in set (0.13 sec)

select distinct md5 from image;
| e127d0e91af5d8b2522138fb46c2e1bc |
| 7a18b029925d8357599878a85fd6b02f |
+--+
170 rows in set (0.00 sec)

same nr of rows :D




-- 
Catalin Constantin



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Implicit Stopping in StandardTokenizer??

2005-06-20 Thread Erik Hatcher


On Jun 20, 2005, at 10:41 AM, Max Pfingsthorn wrote:


Hi!

I've been trying to make an Analyzer which works like the  
StandardAnalyzer but without stopping. For some reason though, I  
still don't get words like "is" or "a" out of it... I checked with  
Luke (one doc in one index with the contents  
"hello,this,is,a,keyword,hello!,nicetomeetyou". This should  
tokenize into "hello this is a keyword hello nicetomeetyou", but  
actually it does "hello keyword hello nicetomeetyou". Does anyone  
know why it drops those extra terms?


Show us the code of your analyzer.

If all you want is StandardAnalyzer to not remove stop words, you can  
construct it this way:


analyzer = new StandardAnalyzer(new String[] {});

The String[] are the stop words to remove, in this case none.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Implicit Stopping in StandardTokenizer??

2005-06-20 Thread Max Pfingsthorn
Hi!

Here is the code:


import java.io.Reader;

import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;

public class SimpleStandardAnalyzer extends Analyzer {
  public SimpleStandardAnalyzer() {
  }

  public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseFilter(
  new StandardFilter(
new StandardTokenizer(reader)
  )
);
  }
}

I need it to be easily dynamically loaded via Class.forName() because we use it 
in a xml-configured environment (i.e. Avalon-like). So, passing extra stuff to 
constructors is not really what I am looking for. However, I guess I could make 
a wrapper like this:

public class SimpleStandardAnalyzer extends StandardAnalyzer {

  public SimpleStandardAnalyzer()
  {
super(new String[0]);
  }
}

I tried this too, but still the same effect. "This" and "is", etc, get filtered 
out even with no stopwords set. Any ideas?

Thanks a lot!
max


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, June 20, 2005 16:57
To: java-user@lucene.apache.org
Subject: Re: Implicit Stopping in StandardTokenizer??



On Jun 20, 2005, at 10:41 AM, Max Pfingsthorn wrote:

> Hi!
>
> I've been trying to make an Analyzer which works like the  
> StandardAnalyzer but without stopping. For some reason though, I  
> still don't get words like "is" or "a" out of it... I checked with  
> Luke (one doc in one index with the contents  
> "hello,this,is,a,keyword,hello!,nicetomeetyou". This should  
> tokenize into "hello this is a keyword hello nicetomeetyou", but  
> actually it does "hello keyword hello nicetomeetyou". Does anyone  
> know why it drops those extra terms?

Show us the code of your analyzer.

If all you want is StandardAnalyzer to not remove stop words, you can  
construct it this way:

 analyzer = new StandardAnalyzer(new String[] {});

The String[] are the stop words to remove, in this case none.

 Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[4]: md5 keyword field issue

2005-06-20 Thread Erik Hatcher


On Jun 20, 2005, at 10:54 AM, [EMAIL PROTECTED] wrote:

Monday, June 20, 2005, 5:48:30 PM, Erik Hatcher wrote:


Now you've just said the same conflicting thing a different way.  You
want to cluster but only return one.  :)



i think i missunderstood here the Term: cluster.
so yes, i just want one image returned.


Maybe my interpretation of "cluster" is clouded by the search  
domain.  In the search domain, cluster means grouping multiple things.



If you only want one image returned, then it seems that only indexing
the same image once is the way to go.  When you find a duplicate MD5,
don't index that as a second document.  You will, instead, update the
document by adding additional ALT text and perhaps the additional  
URL.




this sounds pretty ok !


The tricks are to do a search when indexing to find duplicates, and  
to "update" the document by deleting and re-adding it (you'll  
probably want to store the field data so you can retrieve it easily  
and use it for the new updated document.


The negative to this approach is you want know specifically which  
page the image was on in results, though you could keep all URL's  
that point to it as a document can have multiple fields named "URL"  
for example.



in sql this would be:
select distinct md5, url, alt from table group by md5 order by
score asc;





This would give you multiple records for the same MD5.  You said
above you only want one per MD5.



here i'm afraid you are not correct, because i have GROUP BY MD5
clause which will return no duplicates.


Sorry, I missed the GROUP BY clause there in my first human parse of  
the expression - I was too busy focusing on DISTINCT.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Implicit Stopping in StandardTokenizer??

2005-06-20 Thread Erik Hatcher


On Jun 20, 2005, at 11:48 AM, Max Pfingsthorn wrote:

import java.io.Reader;

import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;

public class SimpleStandardAnalyzer extends Analyzer {
  public SimpleStandardAnalyzer() {
  }

  public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseFilter(
  new StandardFilter(
new StandardTokenizer(reader)
  )
);
  }
}


That looks fine.  Stop words will not (well, "should not" it appears  
to you!) be removed from that analyzer.


I tried this too, but still the same effect. "This" and "is", etc,  
get filtered out even with no stopwords set. Any ideas?


The only idea I have is that you're running StandardAnalyzer and not  
this custom one by something amiss in your indirect configuration.


For example, I typed in "this is" to the AnalyzerDemo that ships with  
the Lucene in Action source code (get it from http:// 
www.lucenebook.com) after modifying AnalyzerDemo to construct a  
StandardAnalyzer like this:


new StandardAnalyzer(new String[] {})

And got this output:

$ ant AnalyzerDemo

AnalyzerDemo:
[input] String to analyze: [This string will be analyzed.]
this is
 [echo] Running lia.analysis.AnalyzerDemo...
 [java] Analyzing "this is"
 [java]   WhitespaceAnalyzer:
 [java] [this] [is]

 [java]   SimpleAnalyzer:
 [java] [this] [is]

 [java]   StopAnalyzer:
 [java]

 [java]   StandardAnalyzer:
 [java] [this] [is]

 [java]   SnowballAnalyzer:
 [java] [this] [is]

 [java]   SnowballAnalyzer:
 [java] [this] [is]

 [java]   SnowballAnalyzer:
 [java] [thi] [i]

As you can see [this] and [is] made it fine through StandardAnalyzer.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: About the field of PhraseQuery

2005-06-20 Thread Paul Elschot
On Monday 20 June 2005 08:57, Paul Libbrecht wrote:
> So why is there no such constructor
>   PhraseQuery(String fieldName)
> and a method
>add(Token tok)
> ??

Tradition?

> That would be much more intuitive I feel!

Regards,
Paul Elschot

> 
> paul
> 
> 
> Le 18 juin 05, à 09:44, Paul Elschot a écrit :
> 
> > It will throw an IllegalArgumentException when a Term is added
> > with a different field, which is probably what happened.
> >
> > For PhraseQuery the field name and the term text could have been
> > used separately in the interface, which might have prevented your bug.
> > For example an alternative PhraseQuery could have constructor with a
> > field name argument, and term texts could be added at phrase positions.
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Configuration Strategies

2005-06-20 Thread Yousef Ourabi
Hello:
I have a couple of quesitons on configuration strategies. I have a
project where I have to deal with changing search requirements, for
example one search may want to use term-vectors to find "keywords like
this" or whatever, and the next may not.

Another requirement is that when I call a "shutdown" call, the current
settings are saved to an xml file, so that the nex time the main
SearchFacade class is started it re-reads this file and picks-up where
it left off.

How have other lucene users dealt with this? Thanks for any input.

Best,
Yousef

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Configuration Strategies

2005-06-20 Thread Erik Hatcher


On Jun 20, 2005, at 3:36 PM, Yousef Ourabi wrote:


Hello:
I have a couple of quesitons on configuration strategies. I have a
project where I have to deal with changing search requirements, for
example one search may want to use term-vectors to find "keywords like
this" or whatever, and the next may not.

Another requirement is that when I call a "shutdown" call, the current
settings are saved to an xml file, so that the nex time the main
SearchFacade class is started it re-reads this file and picks-up where
it left off.

How have other lucene users dealt with this? Thanks for any input.


You might find some useful tidbits in the "I Love Lucene" case study  
of TheServerSide which was written for "Lucene in Action":




The XML configuration file concepts discussed are along the lines of  
what you're after, I think.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Strange Error Problem

2005-06-20 Thread Youngho Cho
Hello,

I develop our system using lucene 1.4.3 with RemoteSearchable.
Currently I got the following error message.
But I don't konw why it happend and how to fix it.

Could you explain what situation give the following error  ???

Thanks.

Youngho

java.io.IOException: read past EOF
at org.apache.lucene.store.InputStream.refill(InputStream.java:154)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readBytes(InputStream.java:57)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356)
at org.apache.lucene.index.MultiReader.norms(MultiReader.java:159)
at 
org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:165)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
at 
org.apache.lucene.search.RemoteSearchable.search(RemoteSearchable.java:60)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:324)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:261)
at sun.rmi.transport.Transport$1.run(Transport.java:148)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:144)
at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:460)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:701)
at java.lang.Thread.run(Thread.java:534)
at 
sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:247)
at 
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:223)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:133)
at 
com.nannet.fulcrum.lucene.util.RefinedRemoteSearchable_Stub.search(Unknown 
Source)
at 
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:251)

RE: Problems with concurrency using Tomcat

2005-06-20 Thread JMA

I will try that thanks

-Original Message-
From: Fred Vos [mailto:[EMAIL PROTECTED]
Sent: Monday, June 20, 2005 3:16 AM
To: java-user@lucene.apache.org
Subject: Re: Problems with concurrency using Tomcat


On Sun, Jun 19, 2005 at 07:05:06PM -0400, Joseph MarkAnthony wrote:
>
> I have been struggling with this problem for months and have made no
> progress.
> I have created a simple web app for my Lucene indices that allow the user
to
> rebuild or update the index.
> In the *same* web app (perhaps this is important), I have the search
module.
>
> The update module works fine - until someone tries a search.  Then the
> update will no longer be allowed to access the index directory.  The error
> is somethign like "cannot delete _a.cfs" or somethign similar.
>
> I'm running Tomcat on Win2000, and I've pretty much confirmed that it's a
> WINDOWS locking problem, not a lucene problem per se.  This is the windows
> "Cannot delete, File may be in use" situation that you may often get if
> someone else if using a file and you try to delete it.  You see this
> situation many times outside of Lucene.
>
> Does anyone know how to solve this?  It seems to be a Tomcat problem
because
> I cannot duplicate the error in a standalone Java app.  I've seen a few
> posts on this and the resolution is often to copy the index and then
update,
> then copy it back.  This can't be the only way to fix this.

Have you tried to add the -DdisableLuceneLocks=true flag to the RUNJAVA
command of catalina.sh? I had to add it to run servlets accessing lucene,
but
I'm not sure what my problem was.

Lines 219-227 of my catalina.sh (running jakarta tomcat 5.0.16 under linux):

  else
"$_RUNJAVA" $JAVA_OPTS $CATALINA_OPTS \
  -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
  -Dcatalina.base="$CATALINA_BASE" \
  -Dcatalina.home="$CATALINA_HOME" \
  -Djava.io.tmpdir="$CATALINA_TMPDIR" \
  -DdisableLuceneLocks=true \
  org.apache.catalina.startup.Bootstrap "$@" start \
  >> "$CATALINA_BASE"/logs/catalina.out 2>&1 &

Fred

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]