Questions about GermanAnalyzer/Stemmer

2005-03-01 Thread Jon Humble
Hello,
 
We’re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:
 
(1) Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2) With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results. 
(3) In a similar vein to (2), wildcard searches with escaped special
characters fail to find results. So a search for co\-operative works but
a search for co\-op* fails.
 
I will be grateful for any light that can be shed on these problems.
 
With Thanks,
 
Jon.
 
Jon Humble
BSc (hons,)
Software Engineer
eMail: [EMAIL PROTECTED]

TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom
 
Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com
 
 


Is IndexSearcher thread safe?

2005-03-01 Thread Volodymyr Bychkoviak
Is it thread-safe to share one
instance of IndexSearcher between multiple threads?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

2005-03-01 Thread Jonathan O'Connor
Jon,
I too found some problems with the German analyser recently. Here's what
may help:
1. You can try reading Joerg Caumanns' paper A Fast and Simple Stemming
Algorithm for German Words. This paper describes the algorithm
implemented by GermanAnalyser.
2. I guess German nouns all capitalized, so maybe that's why. Although you
would want to be indexing well written German and not emails or text
messages!
3. The German Stemmer converts umlauts into some funny form (the code is a
bit tricky, and I didn't spend any time looking at it), so maybe thats why
you can't find umlauts properly. I think the main reason for this umlaut
change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
(that ae is a umlaut).

Finally, to really understand what's happening, get your hands on Luke. I
just got it last week, and its brilliant. It shows you everything about
your indexes. You can also feed text to an Analyser, and see what it makes
of it. This will show you the real reason why your umlaut search is
failing.
Ciao,
Jonathan O'Connor
XCOM Dublin



Jon Humble [EMAIL PROTECTED]
01/03/2005 09:35
Please respond to
Lucene Users List lucene-user@jakarta.apache.org


To
lucene-user@jakarta.apache.org
cc

Subject
Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






Hello,

We?re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:

(1) Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2) With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results.
(3) In a similar vein to (2), wildcard searches with escaped special
characters fail to find results. So a search for co\-operative works but
a search for co\-op* fails.

I will be grateful for any light that can be shed on these problems.

With Thanks,

Jon.

Jon Humble
BSc (hons,)
Software Engineer
eMail: [EMAIL PROTECTED]

TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom

Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com






*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe Mobilisierung von Lotus Notes Applikationen  in Berlin 
(05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für 
den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, 
Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine 
fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine 
Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of 
the intended recipient. Any review, distribution by others or forwarding 
without express permission is strictly prohibited. If you are not the intended 
recipient, please contact the sender and delete all copies.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

2005-03-01 Thread Erik Hatcher
I had to moderate both Jonathan and Jon's messages in to the list.  
Please subscribe to the list and post to it with the address you've 
subscribed.  I cannot always guarantee I'll catch moderation messages 
and send them through in a timely fashion.

Erik
On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote:
Jon,
I too found some problems with the German analyser recently. Here's 
what
may help:
1. You can try reading Joerg Caumanns' paper A Fast and Simple 
Stemming
Algorithm for German Words. This paper describes the algorithm
implemented by GermanAnalyser.
2. I guess German nouns all capitalized, so maybe that's why. Although 
you
would want to be indexing well written German and not emails or text
messages!
3. The German Stemmer converts umlauts into some funny form (the code 
is a
bit tricky, and I didn't spend any time looking at it), so maybe thats 
why
you can't find umlauts properly. I think the main reason for this 
umlaut
change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
(that ae is a umlaut).

Finally, to really understand what's happening, get your hands on 
Luke. I
just got it last week, and its brilliant. It shows you everything about
your indexes. You can also feed text to an Analyser, and see what it 
makes
of it. This will show you the real reason why your umlaut search is
failing.
Ciao,
Jonathan O'Connor
XCOM Dublin


Jon Humble [EMAIL PROTECTED]
01/03/2005 09:35
Please respond to
Lucene Users List lucene-user@jakarta.apache.org
To
lucene-user@jakarta.apache.org
cc
Subject
Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]


Hello,
We?re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:
(1) Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2) With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results.
(3) In a similar vein to (2), wildcard searches with escaped 
special
characters fail to find results. So a search for co\-operative works 
but
a search for co\-op* fails.

I will be grateful for any light that can be shed on these problems.
With Thanks,
Jon.
Jon Humble
BSc (hons,)
Software Engineer
eMail: [EMAIL PROTECTED]
TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom
Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com


*** Aktuelle Veranstaltungen der XCOM AG ***
XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events
Workshop-Reihe Mobilisierung von Lotus Notes Applikationen  in 
Berlin (05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

*** XCOM AG Legal Disclaimer ***
Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist 
allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. 
Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail 
untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich 
vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole 
use of the intended recipient. Any review, distribution by others or 
forwarding without express permission is strictly prohibited. If you 
are not the intended recipient, please contact the sender and delete 
all copies.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Custom filters document numbers

2005-03-01 Thread tomsdepot-lucene
I'm also interested in knowing what can change the doc numbers.

Does this happen frequently?  Like Stanislav has been asking... what sort of
operations on the index cause the document number to change for any given
document?  If the document numbers change frequently, is there a
straightforward way to modify Lucene to keep the document numbers the same for
the life of the document?  I'd like to have mappings in my sql database that
point to the document numbers that Lucene search returns in its Hits objects.

Thanks,

-Tom-

--- Stanislav Jordanov [EMAIL PROTECTED] wrote:

 The first statement is clear to me:
 I know that an IndexReader sees a 'snapshot' of the document set that was
 taken in the moment of the Reader's creation.
 
 What I don't know is whether this 'snapshot' has also its doc numbers fixed
 or they may change asynchronously.
 And another thing I don't know is what are the index operations that may
 cause the (doc - doc number) mapping to change.
 Is it only after delete or there are other ocasions, or I'd better not count
 on this at all.
 
 StJ
 
 - Original Message - 
 From: Vanlerberghe, Luc [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Thursday, February 24, 2005 4:07 PM
 Subject: RE: Custom filters  document numbers
 
 
  An IndexReader will always see the same set of documents.
  Even if another process deletes some documents, adds new ones or
  optimizes the complete index, your IndexReader instance will not see
  those changes.
 
  If you detect that the Lucene index changed (e.g. by calling
  IndexReader.getCurrentVersion(...) once in a while), you should close
  and reopen your 'current' IndexReader and recalculate any data that
  relies on the Lucene document numbers.
 
  Regards, Luc.
 
  -Original Message-
  From: Stanislav Jordanov [mailto:[EMAIL PROTECTED]
  Sent: donderdag 24 februari 2005 14:18
  To: Lucene Users List
  Subject: Custom filters  document numbers
 
  Given an IndexReader a custom filter is supposed to create a bit set,
  that maps each document numbers to {'visible', 'invisible'} On the other
  hand, it is stated that Lucene is allowed to change document numbers.
  Is it guaranteed that this BitSet's view of document numbers won't
  change while the BitSet is still in use (or perhaps the corresponding
  IndexReader is still opened) ?
 
  And another (more low-level) question.
  When Lucene may change document numbers?
  Is it only when the index is optimized after there has been a delete
  operation?
 
  Regards: StJ
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: Is IndexSearcher thread safe?

2005-03-01 Thread Yura Smolsky
Hello, Volodymyr.

VB Additional question.
VB If I'm sharing one instance of IndexSearcher between different threads
VB Is it good to just to drop this instance to GC.
VB Because I don't know if some thread is still using this searcher or done
VB with it.

It is safe to share one instance between many threads and it should be
safe to drop old object to GC.

But I have discovered one strange fact. When you have indexSearcher on
big index, so IndexSearcher object takes a lot of memory (900Mb) and
when you create new IndexSearcher after deletion of all references to
old IndexSearcher then memory consumed my old IndexSearcher will not be
ever freed.
What can community answer on this strange fact?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

2005-03-01 Thread Jonathan O'Connor
Apologies Erik,
This must be one of those apostrophe in email address problems I always
get. Recently I removed the apostrophe from the email address I give out.
Our server recognizes both email addresses, but some of these mail lists
don't like the O'Connor clann!
Ciao,
Jonathan O'Connor
XCOM Dublin



Erik Hatcher [EMAIL PROTECTED]
01/03/2005 12:16
Please respond to
Lucene Users List lucene-user@jakarta.apache.org


To
Lucene Users List lucene-user@jakarta.apache.org
cc

Subject
Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






I had to moderate both Jonathan and Jon's messages in to the list.
Please subscribe to the list and post to it with the address you've
subscribed.  I cannot always guarantee I'll catch moderation messages
and send them through in a timely fashion.

 Erik

On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote:

 Jon,
 I too found some problems with the German analyser recently. Here's
 what
 may help:
 1. You can try reading Joerg Caumanns' paper A Fast and Simple
 Stemming
 Algorithm for German Words. This paper describes the algorithm
 implemented by GermanAnalyser.
 2. I guess German nouns all capitalized, so maybe that's why. Although
 you
 would want to be indexing well written German and not emails or text
 messages!
 3. The German Stemmer converts umlauts into some funny form (the code
 is a
 bit tricky, and I didn't spend any time looking at it), so maybe thats
 why
 you can't find umlauts properly. I think the main reason for this
 umlaut
 change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
 (that ae is a umlaut).

 Finally, to really understand what's happening, get your hands on
 Luke. I
 just got it last week, and its brilliant. It shows you everything about
 your indexes. You can also feed text to an Analyser, and see what it
 makes
 of it. This will show you the real reason why your umlaut search is
 failing.
 Ciao,
 Jonathan O'Connor
 XCOM Dublin



 Jon Humble [EMAIL PROTECTED]
 01/03/2005 09:35
 Please respond to
 Lucene Users List lucene-user@jakarta.apache.org


 To
 lucene-user@jakarta.apache.org
 cc

 Subject
 Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






 Hello,

 We?re using the GermanAnalyzer/Stemmer to index/search our (German)
 Website.
 I have a few questions:

 (1) Why is the GermanAnalyzer case-sensitive? None of the other
 language indexers seem to be. What does this feature add?
 (2) With the German Analyzer, wildcard searches containing extended
 German characters do not seem to work. So, a* is fine but anä* or ö*
 always find zero results.
 (3) In a similar vein to (2), wildcard searches with escaped
 special
 characters fail to find results. So a search for co\-operative works
 but
 a search for co\-op* fails.

 I will be grateful for any light that can be shed on these problems.

 With Thanks,

 Jon.

 Jon Humble
 BSc (hons,)
 Software Engineer
 eMail: [EMAIL PROTECTED]

 TecSphere Ltd
 Centre for Advanced Industry
 Coble Dene, Royal Quays
 Newcastle upon Tyne NE29 6DE
 United Kingdom

 Direct Dial: +44 (191) 270 31 06
 Fax: +44 (191) 270 31 09
 http://www.tecsphere.com






 *** Aktuelle Veranstaltungen der XCOM AG ***

 XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
 Anmeldung und Information unter http://lotus.xcom.de/events

 Workshop-Reihe Mobilisierung von Lotus Notes Applikationen  in
 Berlin (05.03.2005)
 Anmeldung und Information unter http://lotus.xcom.de/events


 *** XCOM AG Legal Disclaimer ***

 Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist
 allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt.
 Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail
 untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich
 vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.

 This email may contain material that is confidential and for the sole
 use of the intended recipient. Any review, distribution by others or
 forwarding without express permission is strictly prohibited. If you
 are not the intended recipient, please contact the sender and delete
 all copies.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe Mobilisierung von Lotus Notes Applikationen  in Berlin 
(05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für 
den Gebrauch durch den vorgesehenen Empfaenger bestimmt. 

RE: help with boolean expression

2005-03-01 Thread Omar Didi
I found something kind fo weird about the way lucene interprets boolean 
expressions wihout parenthesis.
when i run the query A AND B OR C, it returns only the documents that have A(in 
other words as if the query was just the term A). 
when I run the query A OR B AND C, it returns only the documents that have B 
AND C(as if teh query was just B AND C ). I set the default operator in my 
application to be AND. 
can anyone explain this behavior, thanks.

-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Monday, February 28, 2005 2:40 AM
To: Lucene Users List
Subject: Re: help with boolean expression


Omar Didi writes:
 I have a problem understanding how would lucene iterpret this boolean 
 expression : A AND B OR C .
 it neither return the same count as when I enter (A AND B) OR C nor A AND (B 
 OR C). 
 if anyone knows how it is interpreted i would be thankful.
 thanks

A AND B OR C creates a query that requires A and B. C influcenes the 
score, but is neither sufficient nor required for a match.

IMO query parser is broken for queries mixing AND and OR without explicit
braces.
My favorite sample is `a AND b OR c AND d' which equals `a AND b AND c AND d'
in query parser.

I suggested a patch some time ago, but it's still pending in bugzilla.
http://issues.apache.org/bugzilla/show_bug.cgi?id=25820

Don't know if it's still usable with current sources.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Remove document fails

2005-03-01 Thread Alex Kiselevski

Hi,
I have a problem doing IndexReader.delete(int doc)
and it fails on lock error.



Alex Kiselevski

+9.729.776.4346 (desk)
+9.729.776.1504 (fax)

AMDOCS  INTEGRATED CUSTOMER MANAGEMENT




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

RE: Is IndexSearcher thread safe?

2005-03-01 Thread Cocula Remi


Additional question.
If I'm sharing one instance of IndexSearcher between different threads 
Is it good to just to drop this instance to GC.
Because I don't know if some thread is still using this searcher or done 
with it.

Note that as far as one of the threads keep a reference on the IndexSearcher it 
can not be garbage collected.
Perhaps you meant that you do not know how a thread can declare that it does no 
more need the indexSearcher.

To cope this that I created an IndexSercher pool.
The pool contains a list of IndexSearchers and each one is associated with a 
counter. 
To get an IndexSearcher reference one must request it to the pool and then the 
counter is incremented.
(To make it cleaner I had the idea to replace IndexSearcher references in the 
pool with proxy objects thus the pool will never distribute references of 
IndexSearchers to clients objects.
The counter can be manage inside the proxy.)

The pool has the ability to close and dereference an IndexSearcher when it is 
no more used (counter=0).

Hope it helps.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Re[2]: Is IndexSearcher thread safe?

2005-03-01 Thread Cocula Remi

I probably had the same trouble (but I'm not sure).
I have run a test programm that was creating  a lot of IndexSearchers (but also 
close and free them).
It went to an outOfMemory Exception.
But i'm not finished with that problem (need to use a profiler).


But I have discovered one strange fact. When you have indexSearcher on
big index, so IndexSearcher object takes a lot of memory (900Mb) and
when you create new IndexSearcher after deletion of all references to
old IndexSearcher then memory consumed my old IndexSearcher will not be
ever freed.
What can community answer on this strange fact?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Remove document fails

2005-03-01 Thread Volodymyr Bychkoviak
may be you have open IndexWriter at the same time you are trying to 
delete document.

Alex Kiselevski wrote:
Hi,
I have a problem doing IndexReader.delete(int doc)
and it fails on lock error.


Alex Kiselevski
+9.729.776.4346 (desk)
+9.729.776.1504 (fax)
AMDOCS  INTEGRATED CUSTOMER MANAGEMENT


The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Zip Files

2005-03-01 Thread Luke Shannon
Hello;

Anyone have an ideas on how to index the contents within zip files?

Thanks,

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zip Files

2005-03-01 Thread Ernesto De Santis
Hello
first, you need a parser for each file type: pdf, txt, word, etc.
and use a java api to iterate zip content, see:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
use getNextEntry() method
little example:
ZipInputStream zis = new ZipInputStream(fileInputStream);
ZipEntry zipEntry;
while(zipEntry = zis.getNextEntry() != null){
   //use zipEntry to get name, etc.
   //get properly parser for current entry
   //use parser with zis (ZipInputStream)
}
good luck
Ernesto
Luke Shannon escribió:
Hello;
Anyone have an ideas on how to index the contents within zip files?
Thanks,
Luke
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
Ernesto De Santis - Colaborativa.net
Córdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Large Index managing

2005-03-01 Thread Volodymyr Bychkoviak
Hi,
just an idea how to manage large index that is updated very often.
Very often there is need to update an document in index. To update 
document in index you should delete old document from index and then add 
new one. In most cases it require you to open IndexReader, delete 
document, close IndexReader, create IndexWriter, add document, close 
IndexWriter, and re-open IndexSearcher (if index is searched heavily). 
Profiling some applications I found that most time is spend in 
IndexReader.open() method. Also it produces many objects, so it also 
gives GC overhead.

Idea to optimize this process is to create two indexes. One main index 
that could be very large and second index that will serve as change 
buffer. We can keep one IndexReader open for the first index. (and use 
it for searching and for deleting old documents). Second index is small 
and we can reopen IndexReader frequently when needed.

when second index reaches some number of documents we can merge it with 
main index.
to search this multi index we could use MultiSearcher over this two 
indexes but with little trick: first IndexSearcher is kept same during 
all time till second index is merged with main and second IndexSearcher 
is reopened when second index changes.

It is just idea. (It is not tested)
Will it help to improve speed of updating large index and lower memory 
overhead?
Any comments?

Regards,
Volodymyr Bychkoviak

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zip Files

2005-03-01 Thread Luke Shannon
Thanks Ernesto.

The issue I'm working with now (this is more lack of experience than
anything) is getting an input I can index. All my indexing classes (doc,
pdf, xml, ppt) take a File object as a parameter and return a Lucene
Document containing all the fields I need.

I'm struggling with how I can work with an  array of bytes  instead of a
Java File.

It would be easier to unzip the zip to a temp directory, parse the files and
than delete the directory. But this would greatly slow indexing and use up
disk space.

Luke

- Original Message - 
From: Ernesto De Santis [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Tuesday, March 01, 2005 10:48 AM
Subject: Re: Zip Files


 Hello

 first, you need a parser for each file type: pdf, txt, word, etc.
 and use a java api to iterate zip content, see:

 http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html

 use getNextEntry() method

 little example:

 ZipInputStream zis = new ZipInputStream(fileInputStream);
 ZipEntry zipEntry;
 while(zipEntry = zis.getNextEntry() != null){
 //use zipEntry to get name, etc.
 //get properly parser for current entry
 //use parser with zis (ZipInputStream)
 }

 good luck
 Ernesto

 Luke Shannon escribió:

 Hello;
 
 Anyone have an ideas on how to index the contents within zip files?
 
 Thanks,
 
 Luke
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 

 -- 
 Ernesto De Santis - Colaborativa.net
 Córdoba 1147 Piso 6 Oficinas 3 y 4
 (S2000AWO) Rosario, SF, Argentina.



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Stanislav Jordanov wrote:
startTs = System.currentTimeMillis();
dummyMethod(hits.doc(nHits - nHits));
stopTs = System.currentTimeMillis();
System.out.println(Last doc accessed in  + (stopTs -
startTs)
+ ms);
'nHits - nHits' always equals zero.  So you're actually printing the 
first document, not the last.  The last document would be accessed with 
'hits.doc(nHits)'.  Accessing the last document should not be much 
slower (or faster) than accessing the first.

200+ milliseconds to access a document does seem slow.  Where is you 
index stored?  On a local hard drive?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zip Files

2005-03-01 Thread Chris Lamprecht
Luke,

Look at the javadocs for java.io.ByteArrayInputStream - it wraps a
byte array and makes it accessible as an InputStream.  Also see
java.util.zip.ZipFile.  You should be able to read and parse all
contents of the zip file in memory.

http://java.sun.com/j2se/1.4.2/docs/api/java/io/ByteArrayInputStream.html


On Tue, 1 Mar 2005 12:39:17 -0500, Luke Shannon
[EMAIL PROTECTED] wrote:
 Thanks Ernesto.
 
 I'm struggling with how I can work with an  array of bytes  instead of a
 Java File.
 
 It would be easier to unzip the zip to a temp directory, parse the files and
 than delete the directory. But this would greatly slow indexing and use up
 disk space.
 
 Luke
 
 - Original Message -
 From: Ernesto De Santis [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Tuesday, March 01, 2005 10:48 AM
 Subject: Re: Zip Files
 
  Hello
 
  first, you need a parser for each file type: pdf, txt, word, etc.
  and use a java api to iterate zip content, see:
 
  http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
 
  use getNextEntry() method
 
  little example:
 
  ZipInputStream zis = new ZipInputStream(fileInputStream);
  ZipEntry zipEntry;
  while(zipEntry = zis.getNextEntry() != null){
  //use zipEntry to get name, etc.
  //get properly parser for current entry
  //use parser with zis (ZipInputStream)
  }
 
  good luck
  Ernesto
 
  Luke Shannon escribió:
 
  Hello;
  
  Anyone have an ideas on how to index the contents within zip files?
  
  Thanks,
  
  Luke
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
  
 
  --
  Ernesto De Santis - Colaborativa.net
  Córdoba 1147 Piso 6 Oficinas 3 y 4
  (S2000AWO) Rosario, SF, Argentina.
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Investingating Lucene For Project

2005-03-01 Thread Scott Purcell
I am looking for a solution to a problem I am having. We have a web-based asset 
management solution where we manage customers assets.
 
We have had requests from some clients who would like the ability to index  
PDF files, now and possibly other text files in the future. The PDF files live 
on a server and are in a structured environment. I would like to somehow index 
the content inside the PDF and be able to run searches on that information from 
a web-form. The result MUST BE  a text snippet (that being some text prior to 
the searched word and after the searched word). 
Does this make sense? And can Lucene do this?
 
If the product can do this, how is the best way to get rolling on a project of 
this nature? Purchase an example book, or are there simple examples one can 
pick up on? Does Lucene have a large learning curve? or reasonably quick?
 
If all the above will work, what kind of license does this require? I have not 
been able to find a link to that yet on the jakarta site.
 
I sincerely appreciate any input into this.
 
Sincerely
Scott 
 


Re: Investingating Lucene For Project

2005-03-01 Thread Ben Litchfield

See inlined comments below.

 We have had requests from some clients who would like the ability to
 index  PDF files, now and possibly other text files in the future. The
 PDF files live on a server and are in a structured environment. I would
 like to somehow index the content inside the PDF and be able to run
 searches on that information from a web-form. The result MUST BE a text
 snippet (that being some text prior to the searched word and after the
 searched word).  Does this make sense? And can Lucene do this?


Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox (http://www.pdfbox.org/) can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create that
yourself.

 If the product can do this, how is the best way to get rolling on a
 project of this nature? Purchase an example book, or are there simple
 examples one can pick up on? Does Lucene have a large learning curve? or
 reasonably quick?

There are tutorials available on the website, and I would recommend
the Lucene in Action book.  There is a learning curve for lucene, but it
sounds like your requirements are pretty basic so it shouldn't be that
hard.



 If all the above will work, what kind of license does this require? I
 have not been able to find a link to that yet on the jakarta site.

http://www.apache.org/licenses/LICENSE-2.0

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Luke Francl
Lucene Users,

We have a requirement for a new version of our software that it run in a
clustered environment. Any node should be able to go down but the
application must keep functioning.

Currently, we use Lucene on a single node but this won't meet our fail
over requirements. If we can't find a solution, we'll have to stop using
Lucene and switch to something else, like full text indexing inside the
database.

So I'm looking for best practices on distributing Lucene indexing and
searching. I'd like to hear from those of you using Lucene in a
multi-process environment what is working for you. I've done some
research, and based on on what I've seen so far, here's a bit of
brainstorming on what seems to be possible:

1. Don't. Have a single indexing and searching node. [Note: this is the
last resort.]

2. Don't distribute indexing. Searching is distributed by storing the
index on NFS. A single indexing node would process all requests.
However, using Lucene on NFS is *not* recommended. See:
http://lucenebook.com/search?query=nfs ...it can result in stale NFS
file handle problem:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12481.html
So we'd have to investigate this option. Indexing could use an JMS queue
so if the box goes down, when it comes back up, indexing could resume
where it left off.

3. Distribute indexing and searching into separate indexes for each
node. Combine results using ParallelMultiSearcher. If a box went down, a
piece of the index would be unavailable. Also, there would be serious
issues making sure assets are indexed in the right place to prevent
duplicates, stale results, or deleted assets from showing up in the
index. Another possibility would be a hashing scheme for
indexing...assets could be put into buckets based on their
IDs to prevent duplication. Keeping results consistent as you're
changing the number of the buckets as the nodes come up and down would
be a challenge though

4. Distribute indexing and searching, but index everything at each node.
Each node would have a complete copy of the index. Indexing would be
slower. We could move to a 5 or 15 minute batch approach.

5. Index centrally and push updated indexes to search nodes on a
periodic basis. This would be easy and might avoid the problems with
using NFS.

6. Index locally and synchronize changes periodically. This is an
interesting idea and bears looking into. Lucene can combine multiple
indexes into a single one, which can be written out somewhere else, and
then distributed back to the search nodes to replace their existing
index.

7. Create a JDBCDirectory implementation and let the database handle the
clustering. A JDBCDirectory exists
(http://ppinew.mnis.com/jdbcdirectory/), but has only been tested with
MySQL. It would probably require modification (the code is under the
LGPL). At one time, an OracleDirectory implementation existed but that
was in 2000 and so it is surely badly outdated. But in principle, the
concept is possible. However, these database-based directories are
slower at indexing and searching than the traditional style, probably
mostly due to BLOB handling.

8. Can the Berkely DB-based DBDirectory help us? I am not sure what
advantages it would bring over the traditional FSDirectory, but maybe
someone else has some ideas.

Please let me know if you've got any other ideas or a best practice to
follow.

Thanks,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Daniel Naber wrote:
After fixing this I can reproduce the problem with a local index that 
contains about 220.000 documents (700MB). Fetching the first document 
takes for example 30ms, fetching the last one takes 100ms. Of course I 
tested this with a query that returns many results (about 50.000). 
Actually it happens even with the default sorting, no need to sort by some 
specific field.
In part this is due to the fact that Hits first searches for the 
top-scoring 100 documents.  Then, if you ask for a hit after that, it 
must re-query.  In part this is also due to the fact that maintaining a 
queue of the top 50k hits is more expensive than maintaining a queue of 
the top 100 hits, so the second query is slower.  And in part this could 
be caused by other things, such as that the highest ranking document 
might tend to be cached and not require disk io.

One could perform profiling to determine which is the largest factor. 
Of these, only the first is really fixable: if you know you'll need hit 
50k then you could tell this to Hits and have it perform only a single 
query.  But the algorithmic cost of keeping the queue of the top 50k is 
the same as collecting all the hits and sorting them.  So, in part, 
getting hits 49,990 through 50,000 is inherently slower than getting 
hits 0-10.  We can minimize that, but not eliminate it.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Multiple indexes

2005-03-01 Thread Ben
Hi

My site has two types of documents with different structure. I would
like to create an index for each type of document. What is the best
way to implement this?

I have been trying to implement this but found out that 90% of the
code is the same.

In Lucene in Action book, there is a case study on jGuru, it just
mentions them using multiple indexes. I would like to do something
like them.

Any resources on the Internet that I can learn from?

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to manipulate the lucene index table

2005-03-01 Thread Srimant Mishra
Hi all, 

 

 I have a web-based application that we use to index text documents
as well as images; the indexes fields are either Field.Unstored or
Field.Keyword. 

 

 

 Currently, we plan to modify some of the index field names. For
example, if the index field name was DOCLOCALE, we plan to break it up into
two fields: DOCUMENTTYPE and LOCALE. Since, the index files that lucene
creates have become quite big (close to 1 gig), we are looking for a way to
be able to read the index entries and modify them via a standalone Java
program.

 

Does lucene provide any APIs to read these index entries and update them? Is
there an easy way to do it?

 

Thanks in advance

Srimant



Re: Multiple indexes

2005-03-01 Thread Erik Hatcher
It's hard to answer such a general question with anything very precise, 
so sorry if this doesn't hit the mark.  Come back with more details and 
we'll gladly assist though.

First, certainly do not copy/paste code.  Use standard reuse practices, 
perhaps the same program can build the two different indexes if passed 
different parameters, or share code between two different programs as a 
JAR.

What specifically are the issues you're encountering?
Erik
On Mar 1, 2005, at 8:06 PM, Ben wrote:
Hi
My site has two types of documents with different structure. I would
like to create an index for each type of document. What is the best
way to implement this?
I have been trying to implement this but found out that 90% of the
code is the same.
In Lucene in Action book, there is a case study on jGuru, it just
mentions them using multiple indexes. I would like to do something
like them.
Any resources on the Internet that I can learn from?
Thanks,
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Yonik Seeley
 6. Index locally and synchronize changes periodically. This is an
 interesting idea and bears looking into. Lucene can combine multiple
 indexes into a single one, which can be written out somewhere else, and
 then distributed back to the search nodes to replace their existing
 index.

This is a promising idea for handling a high update volume because it
avoids all of the search nodes having to do the analysis phase.

Unfortunately, the way addIndexes() is implemented looks like it's
going to present some new problems:

  public synchronized void addIndexes(Directory[] dirs)
  throws IOException {
optimize();   // start with zero or 1 seg
for (int i = 0; i  dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos();  // read infos from dir
  sis.read(dirs[i]);
  for (int j = 0; j  sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}
optimize();   // final cleanup
  }

We need to deal with some very large indexes (40G+), and an optimize
rewrites the entire index, no matter how few documents were added. 
Since our strategy calls for deleting some docs on the primary index
before calling addIndexes() this means *both* calls to optimize() will
end up rewriting the entire index!

The ideal behavior would be that of addDocument() - segments are only
merged occasionally.   That said, I'll throw out a replacement
implementation that probably doesn't work, but hopefully will spur
someone with more knowledge of Lucene internals to take a look at
this.

  public synchronized void addIndexes(Directory[] dirs)
  throws IOException {
// REMOVED: optimize();
for (int i = 0; i  dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos();  // read infos from dir
  sis.read(dirs[i]);
  for (int j = 0; j  sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}
maybeMergeSegments();   // replaces optimize
  }

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple indexes

2005-03-01 Thread Ben
Is it true that for each index I have to create a seperate instance
for FSDirectory, IndexWriter and IndexReader? Do I need to create a
seperate locking mechanism as well?

I have already implemented a program using just one index.

Thanks,
Ben

On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher
[EMAIL PROTECTED] wrote:
 It's hard to answer such a general question with anything very precise,
 so sorry if this doesn't hit the mark.  Come back with more details and
 we'll gladly assist though.
 
 First, certainly do not copy/paste code.  Use standard reuse practices,
 perhaps the same program can build the two different indexes if passed
 different parameters, or share code between two different programs as a
 JAR.
 
 What specifically are the issues you're encountering?
 
 Erik
 
 
 On Mar 1, 2005, at 8:06 PM, Ben wrote:
 
  Hi
 
  My site has two types of documents with different structure. I would
  like to create an index for each type of document. What is the best
  way to implement this?
 
  I have been trying to implement this but found out that 90% of the
  code is the same.
 
  In Lucene in Action book, there is a case study on jGuru, it just
  mentions them using multiple indexes. I would like to do something
  like them.
 
  Any resources on the Internet that I can learn from?
 
  Thanks,
  Ben
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Doug Cutting
Yonik Seeley wrote:
6. Index locally and synchronize changes periodically. This is an
interesting idea and bears looking into. Lucene can combine multiple
indexes into a single one, which can be written out somewhere else, and
then distributed back to the search nodes to replace their existing
index.
This is a promising idea for handling a high update volume because it
avoids all of the search nodes having to do the analysis phase.
A clever way to do this is to take advantage of Lucene's index file 
structure.  Indexes are directories of files.  As the index changes 
through additions and deletions most files in the index stay the same. 
So you can efficiently synchronize multiple copies of an index by only 
copying the files that change.

The way I did this for Technorati was to:
1. On the index master, periodically checkpoint the index.  Every minute 
or so the IndexWriter is closed and a 'cp -lr index index.DATE' command 
is executed from Java, where DATE is the current date and time.  This 
efficiently makes a copy of the index when its in a consistent state by 
constructing a tree of hard links.  If Lucene re-writes any files (e.g., 
the segments file) a new inode is created and the copy is unchanged.

2. From a crontab on each search slave, periodically poll for new 
checkpoints.  When a new index.DATE is found, use 'cp -lr index 
index.DATE' to prepare a copy, then use 'rsync -W --delete 
master:index.DATE index.DATE' to get the incremental index changes. 
Then atomically install the updated index with a symbolic link (ln -fsn 
index.DATE index).

3. In Java on the slave, re-open 'index' it when its version changes. 
This is best done in a separate thread that periodically checks the 
index version.  When it changes, the new version is opened, a few 
typical queries are performed on it to pre-load Lucene's caches.  Then, 
in a synchronized block, the Searcher variable used in production is 
updated.

4. In a crontab on the master, periodically remove the oldest checkpoint 
indexes.

Technorati's Lucene index is updated this way every minute.  A 
mergeFactor of 2 is used on the master in order to minimize the number 
of segments in production.  The master has a hot spare.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple indexes

2005-03-01 Thread Otis Gospodnetic
Ben,

You do need to use a separate instance of those 3 classes for each
index yes.  But this is really something like:

IndexWriter writer = new IndexWriter();

So it's normal code-writing process you don't really have to create
anything new, just use existing Lucene API.  As for locking, again you
don't need to create anything.  Lucene does have a locking mechanism,
but most of it should be completely invisible to you if you follow the
concurrency rules.

I hope this helps.

Otis

--- Ben [EMAIL PROTECTED] wrote:

 Is it true that for each index I have to create a seperate instance
 for FSDirectory, IndexWriter and IndexReader? Do I need to create a
 seperate locking mechanism as well?
 
 I have already implemented a program using just one index.
 
 Thanks,
 Ben
 
 On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher
 [EMAIL PROTECTED] wrote:
  It's hard to answer such a general question with anything very
 precise,
  so sorry if this doesn't hit the mark.  Come back with more details
 and
  we'll gladly assist though.
  
  First, certainly do not copy/paste code.  Use standard reuse
 practices,
  perhaps the same program can build the two different indexes if
 passed
  different parameters, or share code between two different programs
 as a
  JAR.
  
  What specifically are the issues you're encountering?
  
  Erik
  
  
  On Mar 1, 2005, at 8:06 PM, Ben wrote:
  
   Hi
  
   My site has two types of documents with different structure. I
 would
   like to create an index for each type of document. What is the
 best
   way to implement this?
  
   I have been trying to implement this but found out that 90% of
 the
   code is the same.
  
   In Lucene in Action book, there is a case study on jGuru, it just
   mentions them using multiple indexes. I would like to do
 something
   like them.
  
   Any resources on the Internet that I can learn from?
  
   Thanks,
   Ben
  
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



list moving to lucene.apache.org

2005-03-01 Thread Roy T . Fielding
This list is about to be moved to java-user at lucene.apache.org.
Please excuse the temporary inconvenience.
Cheers,
Roy T. Fielding, co-founder, The Apache Software Foundation
 ([EMAIL PROTECTED])  http://roy.gbiv.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]