Re: some thoughts about adding transactions.

2005-01-11 Thread Scott Ganyo
I didn't want to let this drop this on the floor, but I haven't had the 
time to craft a response to it either.  So, just for the record I agree 
that transactions would be nice.  I think that it is important that the 
solution address change visibility and concurrent transactions within 
multiple VMs.  Also, it should be backward compatible so that 
applications can run without transactions.  So, I think that a good 
solution is probably more complex than it initially looks...

S
On Jan 8, 2005, at 6:47 AM, Peter Veentjer - Anchor Men wrote:
If have a question about transactions .
Lucene doesn`t support transactions but I find it very important and I 
think it is possible to add some kind of rollback/commit functionality 
to make sure the index doesn`t corrupt..

With lucene every segment is immutable (this is a perfect starting 
point), so after it has been created it will remain forever in a valid 
state. There are 3 ways to alter the index
1) deleting documents
2) adding documents
3) optimization

If I delete a document, a del file appears (but doesn`t alter the 
segment because it is immutable).
-if crash: the del files could be deleted to do a rollback.
-if succes: the del files finally will be used by the writer to skip 
those documents in the new segment.

If a new document is added, a new segment is created (finally).
-if succes: the new segment is created and the old segments can be 
deleted.
-if crash: the new segment (maybe it`s corrupted) can be deleted to do 
a rollback.

If the index is optimized a new segment is created based on older 
segments.
-if succes: the old segments can be deleted.
-if crash: the new segment (maybe it`s corrupted) can be deleted to do 
a rollback.

With this information it wouldn`t be to much trouble to add some kind 
of rollback/transaction functionality?

And how about those 'per index' files? Can these be corrupted? Can 
these be removed and recreated succesfully? Would it be an idea to 
make copies of these files and restore them if the tranaction is 
rollbacked?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: dotLucene (port of Jakarta Lucene to C#)

2004-12-01 Thread Scott Ganyo
Why does it seem to you that C# is faster than Java?
In any case, generally the bottleneck isn't the VM.  It's the I/O to 
the disks...

Scott
The reasonable man adapts himself to the world; the unreasonable one 
persists in trying to adapt the world to himself. Therefore all 
progress depends on the unreasonable man. - George Bernard Shaw

On Dec 1, 2004, at 5:42 AM, Nicolas Maisonneuve wrote:
hy george
is the C# lucene faster than java lucene  ?  (because it seems to me
that  C# is faster than java, isn't it  ?)
nicolas maisonneuve

On Sun, 28 Nov 2004 21:08:30 -0500, George Aroush [EMAIL PROTECTED] 
wrote:
Hi folks,
I am please to announce the availability of dotLucene 1.4.0 RC1.  
dotLucene
is a complete port of Jakarta Lucene to C#.  The port is almost a
line-by-line port and it includes the demos as well as all the JUnit 
tests.
An index created by dotLucene is cross compatible with Jakarta Lucene 
and
via verse.

Please visit http://sourceforge.net/projects/dotlucene/ to learn more 
about
dotLucene and to download the source code.

Best regards,
-- George Aroush
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: BooleanQuery - Too Many Clases on date range.

2004-10-01 Thread Scott Ganyo
You can use:
BooleanQuery.setMaxClauseCount(int maxClauseCount);
to increase the limit.
On Sep 30, 2004, at 8:24 PM, Chris Fraschetti wrote:
I recently read in regards to my problem that date_field:[0820483200
TO 110448]
is evluated into a series of boolean queries ... which has a cap of
1024 ... considering my documents will have dates spanning over many
years, and i need the granualirity of 'by day' searching, are there
any reccomendations on how to make this work?
Currently with query: +content_field:sometext +date_field:[0820483200
TO 110448]
I get the following exception:
org.apache.lucene.search.BooleanQuery$TooManyClauses
any suggestions on how I can still keep the granuality of by day, but
without limiting my search results? Are there any date formats that I
can change those numbers to that would allow me to complete the search
(i.e.  Feb, 15 2004 ) .. can lucene's range do a proper search on
formatted dates?
Is there a combination of RangeQuery and Query/MultiTermQuery that I 
can use?

your help is greatly appreciated.
--
___
Chris Fraschetti
e [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
At one point it definitely supported null for either term.  I think 
that has been removed/forgotten in the later revisions of the 
QueryParser...

Scott
On Jun 10, 2004, at 1:24 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote:
Actually, QueryParser does support open-ended ranges like :  [term TO 
null].
Doesn't work for the lower end of the range (though that's usually 
less of a
problem).
It supports null?  Are you sure?  If so, I'm very confused about it 
because I don't see where in the grammar it has any special handling 
like that.  Could you show an example that demonstrates this?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
It looks to me like Revision 1.18 broke it.
On Jun 10, 2004, at 3:26 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote:
Well, I'm using 1.4 RC3 and the null range upper limit works just 
fine for
searches in two of my fields; one is in the form of a cannonical date 
(eg,
20040610) and the other is in the form of a padded word count (e.g., 
01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates 
later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 
or more
words).
Ah
It works for you because you have numeric values and lexically 
null is greater than any of them.  It is still using it as a lexical 
term value, and not truly making the end open-ended.

This is why null doesn't work at the beginning for you either.  It's 
just being treated as text, just like your numbers are.

PS: This use of null has worked this way since at least 1.2.  As I 
recall,
way back when, null also worked as the first term limit (but no 
longer
does).
If so, then something serious broke.  I've not the time to check the 
cvs logs on this, but I cannot imagine that we removed something like 
this.  If anyone cares to dig up the diff where we removed/broke this, 
I'd be gracious.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
Well, I do like the *, but apparently there are some people that are  
using this with the null...

Scott
On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote:
It looks to me like Revision 1.18 broke it.
It seems this could be it:
revision 1.18
date: 2002/06/25 00:05:31;  author: briangoetz;  state: Exp;  lines:  
+62 -33
Support for new range query syntax.  The delimiter is  TO , but is  
optional
for backward compatibility with previous syntax.  If the range  
arguments
match the format supported by  
DateFormat.getDateInstance(DateFormat.SHORT),
then they will be converted into the appropriate date strings a la  
DateField.

Added Field.Keyword constructor for Date-valued arguments.
Optimized DateField.timeToString function.
But geez June 2002 and no one has complained since?
Given that this is so outdated, I'm not sure what the right course of  
action is.  There are lots more Lucene users now than there were then.  
 Would adding NULL back be what folks want?  What about simply an  
asterisk to denote open ended-ness?  [* TO term] or [term TO *]

For completeness, here is the diff:
% cvs diff -u -r 1.17 -r 1.18 QueryParser.jj
Index: QueryParser.jj
===
RCS file:  
/home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/ 
QueryParser.jj,v
retrieving revision 1.17
retrieving revision 1.18
diff -u -r1.17 -r1.18
--- QueryParser.jj  20 May 2002 15:45:43 -  1.17
+++ QueryParser.jj  25 Jun 2002 00:05:31 -  1.18
@@ -65,8 +65,11 @@

 import java.util.Vector;
 import java.io.*;
+import java.text.*;
+import java.util.*;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.analysis.*;
+import org.apache.lucene.document.*;
 import org.apache.lucene.search.*;
 /**
@@ -218,35 +221,30 @@
   private Query getRangeQuery(String field,
   Analyzer analyzer,
-  String queryText,
+  String part1,
+  String part2,
   boolean inclusive)
   {
-// Use the analyzer to get all the tokens.  There should be 1 or  
2.
-TokenStream source = analyzer.tokenStream(field,
-  new  
StringReader(queryText));
-Term[] terms = new Term[2];
-org.apache.lucene.analysis.Token t;
+boolean isDate = false, isNumber = false;

-for (int i = 0; i  2; i++)
-{
-  try
-  {
-t = source.next();
-  }
-  catch (IOException e)
-  {
-t = null;
-  }
-  if (t != null)
-  {
-String text = t.termText();
-if (!text.equalsIgnoreCase(NULL))
-{
-  terms[i] = new Term(field, text);
-}
-  }
+try {
+  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
+  df.setLenient(true);
+  Date d1 = df.parse(part1);
+  Date d2 = df.parse(part2);
+  part1 = DateField.dateToString(d1);
+  part2 = DateField.dateToString(d2);
+  isDate = true;
 }
-return new RangeQuery(terms[0], terms[1], inclusive);
+catch (Exception e) { }
+
+if (!isDate) {
+  // @@@ Add number support
+}
+
+return new RangeQuery(new Term(field, part1),
+  new Term(field, part2),
+  inclusive);
   }
   public static void main(String[] args) throws Exception {
@@ -282,7 +280,7 @@
 | #_WHITESPACE: (   | \t ) 
 }
-DEFAULT SKIP : {
+DEFAULT, RangeIn, RangeEx SKIP : {
   _WHITESPACE
 }
@@ -303,14 +301,28 @@
 | PREFIXTERM:  _TERM_START_CHAR (_TERM_CHAR)* * 
 | WILDTERM:  _TERM_START_CHAR
   (_TERM_CHAR | ( [ *, ? ] ))* 
-| RANGEIN:   [ ( ~[ ] ] )+ ]
-| RANGEEX:   { ( ~[ } ] )+ }
+| RANGEIN_START: [  : RangeIn
+| RANGEEX_START: {  : RangeEx
 }
 Boost TOKEN : {
 NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )?  : DEFAULT
 }
+RangeIn TOKEN : {
+RANGEIN_TO: TO
+| RANGEIN_END: ] : DEFAULT
+| RANGEIN_QUOTED: \ (~[\])+ \
+| RANGEIN_GOOP: (~[  , ] ])+ 
+}
+
+RangeEx TOKEN : {
+RANGEEX_TO: TO
+| RANGEEX_END: } : DEFAULT
+| RANGEEX_QUOTED: \ (~[\])+ \
+| RANGEEX_GOOP: (~[  , } ])+ 
+}
+
 // *   Query  ::= ( Clause )*
 // *   Clause ::= [+, -] [TERM :] ( TERM | ( Query ) )
@@ -387,7 +399,7 @@
 Query Term(String field) : {
-  Token term, boost=null, slop=null;
+  Token term, boost=null, slop=null, goop1, goop2;
   boolean prefix = false;
   boolean wildcard = false;
   boolean fuzzy = false;
@@ -415,12 +427,29 @@
else
  q = getFieldQuery(field, analyzer, term.image);
  }
- | ( term=RANGEIN { rangein=true; } | term=RANGEEX )
+ | ( RANGEIN_START (  
goop1=RANGEIN_GOOP|goop1=RANGEIN_QUOTED )
+ [ RANGEIN_TO ] (  
goop2=RANGEIN_GOOP|goop2=RANGEIN_QUOTED )
+ RANGEIN_END )
+   [ CARAT boost=NUMBER ]
+{
+  if (goop1.kind == RANGEIN_QUOTED)
+goop1.image = goop1.image.substring(1,  
goop1

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Scott ganyo
I don't buy it.  HashSet is but one implementation of a Set.  By 
choosing the HashSet implementation you are not only tying the class to 
a hash-based implementation, you are trying the interface to *that 
specific* hash-based implementation or it's subclasses.  In the end, 
either you buy the concept of the interface and its abstraction or you 
don't.  I firmly believe in using interfaces as they were intended to 
be used.

Scott

P.S. In fact, HashSet isn't always going to be the most efficient 
anyway.  Just for one example:  Consider possible implementations if I 
have only 1 or 2 entries.

On Mar 10, 2004, at 11:13 PM, Erik Hatcher wrote:

On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
Erik Hatcher wrote:
Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this 
can just be removed by forcing the caller to use a HashSet (which 
they should).
I've caved in and gone HashSet all the way.
Did you not see my message suggesting a way to both not expose 
HashSet publicly and also not to copy values?  If not, I attached it.
Yes, I saw it.  But is there a reason not to just expose HashSet given 
that it is the data structure that is most efficient?  I bought into 
Kevin's arguments that it made sense to just expose HashSet.

As for copying values - that is only happening now if you use the 
Hashtable or String[] constructor.

	Erik


Doug



From: Doug Cutting [EMAIL PROTECTED]
Date: March 10, 2004 1:08:24 PM EST
To: Lucene Developers List [EMAIL PROTECTED]
Subject: Re: cvs commit: 
jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Reply-To: Lucene Developers List [EMAIL PROTECTED]

[EMAIL PROTECTED] wrote:
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
   super(in);
  -table = stopTable;
  +this.stopWords = new HashSet(stopWords);
 }
This always allocates a new HashSet, which, if the stop list is 
large, and documents are small, could impact performance.

Perhaps we can replace this with something like:

public StopFilter(TokenStream in, Set stopWords) {
  this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
   : new HashSet(stopWords));
}
and then add another constructor:

private StopFilter(TokenStream in, HashSet stopWords) {
  super(in);
  this.stopWords = stopTable;
}
Also, if we want the implementation to always be a HashSet 
internally, for performance, we ought to declare the field to be a 
HashSet, no?

The competing goals here are:
  1. Not to expose publicly the implementation of the Set;
  2. Not to copy the contents of the Set when folks pass the value of 
makeStopSet.
  3. Use the most efficient implementation internally.

I think the changes above meet all of these.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Index advice...

2004-02-10 Thread Scott ganyo
I have.  While document.add() itself doesn't increase over time, the 
merge does.  Ways of partially overcoming this include increasing the 
mergeFactor (but this will increase the number of file handles used), 
or building blocks of the index in memory and then merging them to 
disk.  This has been discussed before, so you should be able to find 
additional information on this fairly easily.

Scott

On Feb 10, 2004, at 7:55 AM, Otis Gospodnetic wrote:

--- Leo Galambos [EMAIL PROTECTED] wrote:
Otis Gospodnetic napsal(a):

Without seeing more information/code, I can't tell which part of
your
system slows down with time, but I can tell you that Lucene's 'add'
does not slow over time (i.e. as the index gets larger).  Therefore,
I
would look elsewhere for causes of the slowdown.


Otis, can you point me to some proofs that time of insert operation

does not depend on the index size, please? Amortized time of insert
is O(log(docsIndexed/mergeFac)), I think.
This would imply that Lucene gets slower as it adds more documents to
the index.  Have you observed this behaviour?  I haven't.
Thus I do not know how it could be O(1).
~ O(1) is what I have observed through experiments with indexing of
several million documents.
Otis


AFAIK the issue with PDF files can be based on the PDF parser (I
already
encountered this with PDFbox).
The easiest thing to do is add logging to suspicious portions of the
code.  That will narrow the scope of the code you need to analyze.
Otis

--- [EMAIL PROTECTED] wrote:


Hey Lucene-users,

I'm setting up a Lucene index on 5G of PDF files (full-text
search).
I've
been really happy with Lucene so far but I'm curious what tips and
strategies
I can use to optimize my performance at this large size.
So far I am using pretty much all of the defaults (I'm new to
Lucene).
I am using PDFBox to add the documents to the index.
I can usually add about 800 or so PDF files and then the add loop:
for ( int i = 0; i  fileNames.length; i++ ) {
Document doc =
IndexFile.index(baseDirectory+documentRoot+fileNames
[i]);
writer.addDocument(doc);
}
really starts to slow down.  Doesn't seem to be memory related.
Thoughts anyone?
Thanks in advance,
CK Hill



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: BooleanQuery question

2004-01-16 Thread Scott ganyo
No, you don't need required or prohibited, but you can't have both.  
Here is a rundown:

* A required clause will allow a document to be selected if and only if 
it contains that clause and will exclude any documents that don't.

* A prohibited clause will exclude any documents that contain that 
clause.

* A clause that is neither prohibited nor required will select a 
document if it contains the clause, but the clause will not prevent 
non-matching documents from being selected by other clauses.

Hopefully that helps,

Scott

On Jan 16, 2004, at 7:32 AM, Thomas Scheffler wrote:

Karl Koch sagte:
Hi all,

why does the boolean query have a required and a prohited field
(boolean
value)? If something is required it cannot be forbidden and 
otherwise? How
does this match with the Boolean model we know from theory?
What if required and prohibited are both off? That's somthing we need.

Are there differences between Lucene and the Boolean model in theory?
To save three conditions you have to take at least 2 bits. That's for 
the
theory.

Kind regards

Thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: java.io.IOException: Bad file number

2003-11-10 Thread Scott Ganyo
I don't think adding extensive locking is necessary.  What you are 
probably experiencing is that you've closed the index before you're done 
using it.  If you aren't careful to close the index only after all 
searches on it have been completed, you'll get an error like this.

Scott

[EMAIL PROTECTED] wrote:

Hello,

I'm trying to debug a problem with a lucene installation which is getting
java.io.IOException: Bad file number occasionally when performing
searches.  More specifically, the exception is coming when we are using a
Reader to extract the hit Documents from the index (due to using
getMessage instead of printStackTrace, I can't tell for sure if the
exception is coming from opening the reader, or getting the
document...arrgh!).
I believe this problem is because of our design, which is that we allow
ongoing multiple searches, and every 30 seconds we have a separate program
which performs updates (adding and deleting documents) on the same index
that is being searched.  After a batch of updates are performed we close
and re-open the IndexSearcher, the idea being that it should now be able to
access the new documents.
Is this a situation where we should have some locking in place that has
searches wait while documents are being added/deleted?  This would be easy
enough to implement, but there is a lot of updating to do, and we don't
want to sacrifice the excellent performance of the search by waiting every
30 seconds while updates happen.  We've thought of two basic paths to take:
1.  Implement a locking mechanism, and maybe try to add/delete one document
each time the updating program aquires the lock, instead of a bigger batch.
We think this might keep the search waiting the least amount of time, but
updates will take longer.
2.  Use a scheme with 2 indexes where we always update the one that isn't
being searched in, and switch between the two.  We are not sure if it makes
sense to perform the switch every 30 seconds in this case.
Does anyone have an idea if I am correct about the cause of the Exception,
or any thoughts on the two possible solutions?  We are running jdk 1.4, and
lucene 1.2 on solaris.
Thanks for any help you can give.

Brad Hendricks



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
Always drink upstream from the herd. - Will Rogers


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple writers

2003-10-29 Thread Scott Ganyo
Offhand, I would say that using 2 directories and merging them is 
exactly what you waht.  It really shouldn't be all that complicated and 
Lucene should handle the synchronization for you...

Scott

Dror Matalon wrote:

Hi folks,

We're in the process of adding search to our online RSS aggregator. You
can see it in action at www.fastbuzz.com.
Currently we have more than five million items in the systems and it's
growing at the rate of more than 100,00 a day.  So we need to take into
account is that the index is constantly growing.
One of the things we want to build into the system is the ability to
rebuild the index on the fly while still inserting the items that are
coming in. 

We've looked at having things go into different directories and then
merge them, but it seems complicated and we'd need to worry about race
conditions and locking issues.
Anyone's done this before? Any suggestions?

Regards,

Dror

 

--
...there is nothing more difficult to execute, nor more dubious of success, nor more 
dangerous to administer than to introduce a new order to things; for he who introduces 
it has all those who profit from the old order as his enemies; and he has only 
lukewarm allies in all those who might profit from the new. - Niccolo Machiavelli


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Limit on number of required/prohibited clauses

2003-09-05 Thread Scott Ganyo
Hi Eugene,

Yes.  Doug (Cutting) added this to eliminate OutOfMemory errors that 
apparently some people were having.  Unfortunately, it causes 
backward-compatibility issues if you were used to using version 1.2.  
So, you'll need to add a call like this:

BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);

(Of course, you can set the parameter to whatever you want, but 
unrestricted works best for me.)

Scott

Eugene S. wrote:

Hi,

I've come across the limit on the number of
required/prohibited clauses in a boolean query (the
limit  is 32). What is the reasoning for having such
limit? Can it be circumvented?
Thanks!

Eugene.

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
All progress is initiated by challenging current conceptions, and executed by 
supplanting existing institutions. - George Bernard Shaw


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reuse IndexSearcher?

2003-08-19 Thread Scott Ganyo
Yes.  You can (and should for best performance) reuse an IndexSearcher 
as long as you don't need access to changes made to the index.  An open 
IndexSearcher won't pick up changes to the index, so if you need to see 
the changes, you will need to open a new searcher at that point.

Scott

Aviran Mordo wrote:

Can I reuse one Instance of IndexSearcher to do multiple searches (in
multiple threads) or do I have to instantiate a new IndexSearcher for
each search?
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Make Lucene Index distributable

2003-08-18 Thread Scott Ganyo
Be careful with option 1.  NFS and the Lucene file-based locking 
mechanism don't get along extremely well.  (See the archives for details...)

Scott

Lienhard, Andrew wrote:

I can think of three options:

1) Single index dir on a shared drive (NFS, etc.) which is mounted on each
app server. 

2) Create copies of the index dir for each machine. Requires regular
updates, etc (not good if search data changes often).
3) Create a web service for search. Each app server makes an HTTP call to a
standalone Lucene app which returns some sort of XML-formatted search
result. 

I've taken approaches 1 and 3 (w/ Verity, but it would likely be the same w/
Lucene). 2 is really only good if you have relatively static data. For our
Lucene rollout here, we're going w/ option 1.
Andrew Lienhard
Web Technology Manager
United Media
200 Madison Avenue
New York, NY 10016
http://www.dilbert.com
http://www.snoopy.com
http://members.comics.com




-Original Message-
From: Uhl V., DP ITS, SCB, FD [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 18, 2003 11:05 AM
To: '[EMAIL PROTECTED]'
Subject: Make Lucene Index distributable

Hallo All,
We have developed our WebApp with Lucene under Tomcat 4.X and stored index
in file system. Now this Web Application have to move to Bea Weblogic
Cluster. My Problem is to create a distributable Index of Lucene. Have one
ideas or experience how to do this?(How to store Index?)
Thanks for every ideas.

Mit freundlichen Grüßen
Vitali Uhl
Client Server Systeme
Deutsche Post ITSolutions GmbH 
tel. +49 (0) 661 / 921 -245 
fax: +49 (0) 661 / 921 -111
internet: http://www.dp-itsolutions.de/ http://www.dp-itsolutions.de/  
Anschrift:
DP ITSolutions GmbH 
D - 36035 Fulda

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
All progress is initiated by challenging current conceptions, and executed by 
supplanting existing institutions. - George Bernard Shaw


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NLucene up to date ?

2003-07-31 Thread Scott Ganyo
Do these implementations maintain file compatibility with the Java version?

Scott

Erik Hatcher wrote:

I'd love to see there be quality implementations of the Lucene API in 
other languages, that are up to date with the latest Java codebase.

I'm embarking on a Ruby port, which I'm hosting at rubyforge.org.  
There is a Python version called Lupy.

A related question I have is what about performance comparisons 
between the different language implementations?  Will Java be the 
fastest?  Is there a test suite already available that can demonstrate 
the performance characteristics of a particular implementation?  I'd 
love to see the numbers and see if even the Java version can be beat.

Erik



On Thursday, July 31, 2003, at 08:43  AM, 
[EMAIL PROTECTED] wrote:

Hi all,

http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2.
Does anyone know if this source is still being maintained to be 
closer to the java developments ?
Was this an external project to Apache Jakarta ?

I (we) have just successfully released a search engine using a c# 
implmentation of Lucene.  Code had to be brought up to date in line 
with recent java builds, and enhanced with additional features (eg 
field sorting, term position score factoring, etc).

Any other c# users who would like to see NLucene kept in line with 
the java version ?

Maybe I'm just being lazy with having to maintain my own version of 
Lucene =).
Surely there are others out there who are c# users and follow the 
mailing lists (I remember a Brian somewhere !) but seldom post.

Brendon



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Luke - Lucene Index Browser

2003-07-14 Thread Scott Ganyo
Nifty cool!  I'm gonna like this, I can tell already!

I'm having a really hard time actually using Luke, though, as all the 
window panes and table columns are apparently of fixed size.  Do you 
think you could through in the ability to resize the various window 
panes and table columns?  This would make the tool truly useful.  Pretty 
please? :)

Thanks,
Scott
Andrzej Bialecki wrote:

Dear Lucene Users,

Luke is a diagnostic tool for Lucene 
(http://jakarta.apache.org/lucene) indexes. It enables you to browse 
documents in existing indexes, perform queries, navigate through 
terms, optimize indexes and more.

Please go to http://www.getopt.org/luke and give it a try. A Java 
WebStart version will be available soon.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incremental indexing

2002-12-05 Thread Scott Ganyo
+1.  Support for transactions in Lucene are high on my list of desirable 
features as well.  I would love to have time to look into adding this, 
but lately... well, you know how that goes.

Scott

Eric Jain wrote:
If you want to update a set of documents, you can remove their previous
version first and then add them after that. In the mean time documents
of this set are temporaly not available. If you have to update a single
document and make the changes immediately public, I don't know a better
solution than yours.



Thanks. I'm not so much worried about temporary inconsistencies as the index
is maintained separately. Of course it would be great if Lucene provided
direct support for some kind of transactional integrity! Anyways, removing
all changed documents first means I have to scan through all documents
twice, not very efficient, though in fact faster than the procedure I
described.


--
Eric Jain


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: How does delete work?

2002-11-22 Thread Scott Ganyo
It just marks the record as deleted.  The record isn't actually removed 
until the index is optimized.

Scott

Rob Outar wrote:

Hello all,

	I used the delete(Term) method, then I looked at the index files, 
only one
file changed _1tx.del  I found references to the file still in some 
of the
index files, so my question is how does Lucene handle deletes?

Thanks,

Rob


--
To unsubscribe, e-mail:
For additional commands, e-mail: 


--
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Fun project?

2002-11-21 Thread Scott Ganyo
I'm rather partial to Jini for distributed systems, but I agree that 
JXTA would definitely be the way to go on this type of peer-to-peer 
scenario.

Scott

[EMAIL PROTECTED] wrote:

I'll be doing something very similar some time in the next 12 months for
the project I'm working on. I'll be more than happy than happy to
contribute the code when its done, but the rest of the project has been
implemented with CORBA, and it had been my plan to use CORBA for the
distributed index servers as well.

I'll look into JXTA though, as I hadn't come across it before.

Kiril





Otis Gospodnetic
21/11/2002 16:57
Please respond to Lucene Users List


To: Lucene Users List
cc:
Subject:Re: Fun project?


Yeah, I thought of that, too.  JXTA is the P2P piece that you are
asking about.  A recent post on Slashdot mentioned something that IBM
did that sounds similar.  Time... :)

Otis

--- Robert A. Decker  wrote:

I wish I had time to work on this for fun, but I was thinking about
what
could be a fun lucene project...

One could build a peer-to-peer document search application. Each
client
would index the documents on its harddrive, or documents in a
particular
directory. When the user at the computer does a search it will look
at the
documents on its harddrive, but also send out a request for the
search on
the P2P network.

First though, are there any P2P java frameworks out there? One could
build
one, perhaps with OpenJMS, but it would be nice if one already
existed.

Hmm... if anyone else thinks this would be cool I'd be willing to
work on
this with you.


thanks,
Robert A. Decker

http://www.robdecker.com/
http://www.planetside.com/



--
To unsubscribe, e-mail:

For additional commands, e-mail:




__
Do you Yahoo!?
Yahoo! Mail Plus ? Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:
For additional commands, e-mail:





--
To unsubscribe, e-mail:
For additional commands, e-mail: 


--
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Searching Ranges

2002-11-11 Thread Scott Ganyo
Hi Alex,

I just looked at this and had the following thought:

The RangeQuery must continue to iterate after the first match is found 
in order to match everything within the specified range.  In other 
words, if you have a range of a to d, you can't stop with a, you 
need to continue to d.  At the point you move beyond d is the point 
where the query should stop iterating.  That is reflected in lines 
160-162.  It seems to me that your solution would only work where your 
range consists of a single term.

Please let me know if I'm just misunderstanding the situation.

Scott

Alex Winston wrote:

thanks for the reply, my apologizes for not explaining myself very
clearly, it has been a long day.

you expressed exactly our situation, unfortunately this is not an option
because we want to have multiple ranges for each document as well,
there is a possible extension of what you suggested but that is a last
resort.  kinda crazy i know, but you have to meet requirements :).

but i also had a thought while i was looking through the lucene code,
and any comments are welcome.

i may be very mistaken because it has been a long day but if you look at
the current cvs version of RangeQuery it appears that even if a match is
found it will continue to iterate over terms within a field, and in my
case it is on the order of thousands.  if i add a break after a match
has been found it appears as though the search is improved on avg an
order of magnitude, my math has left me so i cannot be theoretical at
the moment.  i have unit tested the change on my side and on the lucene
side and it works.  note: one hard example is that a query went from 20
seconds to .5 seconds.  any initial thoughts to if there is a case where
this would not work?

beginning line 164:
TermQuery tq = new TermQuery(term);	  // found a match
tq.setBoost(boost);			   // set the boost
q.add(tq, false, false);		  // add to q
break;  // ADDED!


On Fri, 2002-11-08 at 15:09, Mike Barry wrote:

Alex,

It is rather confusing. It sounds like you've indexed
a field that that can be between two values (let's say
E-J) and then when you have a search term such as G
you want the docs containing E-J (or A-H or F-K but not A-H
nor A-C nor J-Z)

Just of the top of my head but could you index the upper and
lower bounds as separate fields then when you search do a
compound query:

 lower_bound:{ - search_term } AND upper_bound:{ search_term - }

just a thought.

-MikeB.


Alex Winston wrote:


i was hoping that someone could briefly review my current solution to a
problem that we have encountered to see if anyone could suggest a
possible alternative, because as it stands we have pushed lucene past
its current limits.

PROBLEM:

we were wanting to represent a range of values for a particular field
that is searchable over a particular range.

an example follows for clarification:
we were wanting to store a range of chapters and verses of a book for a
particular document, and in turn search to see if a query range includes
the range that is represented in the index.

if this is unclear please ask for clarification

IMPRACTICAL SOLUTION:

although this solution seems somewhat impractical it is all we could
come up with.

our solution involved storing each possible range value within the term
which would allow for RangeQuerys to be performed on this particular
field.  for very small ranges this seems somewhat practical after
profiling.  although once the field ranges began to span multiple
chapters and verses, the search times became unreasonable because we
were storing thousands of entries for each representative range.

i can elaborate on anything that is unclear,
but any thoughts on a possible alternative solution within lucene that
we overlooked would be extremely helpful.
	

alex



--
To unsubscribe, e-mail:
For additional commands, e-mail:





--
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org



Re: Your experiences with Lucene

2002-10-29 Thread Scott Ganyo
Actually, 10k isn't very large.  We have indexes with more than 1M 
records.  It hasn't been a problem.

Scott

Tim Jones wrote:

Hi,

I am currently starting work on a project that requires indexing and
searching on potentially thousands, maybe tens of thousands, of text
documents.

I'm hoping that someone has a great success story about using Lucene for
a project that required indexing and searching of a large number of
documents.
Like maybe more than 10,000. I guess what I'm trying to figure out is if
Lucene's performance will be acceptable where the number of documents is
very large.
I realize this is a very general question but I just need a general
answer.

Thanks,

Tim J.



--
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org



RE: Using Filters in Lucene

2002-07-31 Thread Scott Ganyo

Cool.  But instead of adding a new class, why not change Hits to inherit
from Filter and add the bits() method to it?  Then one could pipe the
output of one Query into another search without modifying the Queries...

Scott

 -Original Message-
 From: Doug Cutting [mailto:[EMAIL PROTECTED]]
 Sent: Monday, July 29, 2002 12:03 PM
 To: Lucene Users List
 Subject: Re: Using Filters in Lucene
 
 
 Peter Carlson wrote:
  Would you suggest that search in selection type 
 functionality use filters or
  redo the search with an AND clause?
 
 I'm not sure I fully understand the question.
 
 If you a condition that is likely to re-occur commonly in subsequent 
 queries, then using a Filter which caches its bit vector is 
 much faster 
 than using an AND clause.  However, you probably cannot 
 afford to keep a 
 large number of such filters around, as the cached bit vectors use a 
 fair amount of memory--one bit per document in the index.
 
 Perhaps the ultimate filter is something like the attached class, 
 QueryFilter.  This caches the results of an arbitrary query in a bit 
 vector.  The filter can then be reused with multiple queries, and (so 
 long as the index isn't altered) that part of the query 
 computation will 
 be cached.  For example, RangeQuery could be used with this, 
 instead of 
 using DateFilter, which does not cache (yet).
 
 Caution: I have not yet tested this code.  If someone does try it, 
 please send a message to the list telling how it goes.  If this is 
 useful, I can document it better and add it to Lucene.
 
 Doug
 
 



RE: Too many open files?

2002-07-23 Thread Scott Ganyo

Are you closing the searcher after each when done?

No: Waiting for the garbage collector is not a good idea.

Yes: It could be a timeout on the OS holding the files handles.

Either way, the only real option is to avoid thrashing the searchers...

Scott

 -Original Message-
 From: Hang Li [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, July 23, 2002 10:10 AM
 To: Lucene Users List
 Subject: Re: Too many open files?
 
 
 Thanks for your quick reponse, I still want to know why we ran out of
 file descriptors.
 
 --Yup.  Cache and reuse your Searcher as much as possible.
 
 --Scott
 
  -Original Message-
  From: Hang Li [mailto:[EMAIL PROTECTED]]
  Sent: Tuesday, July 23, 2002 9:59 AM
  To: Lucene Users List
  Subject: Too many open files?
 
 
  
 
  I have seen a lot postings about this topic. Any final thoughts?
 
  We did a simple stress test, Lucene would produce this error
  between 30 - 80
  concurren searches.  The index directory has 24 files (15 
 fields), and
 
  
  ulimit -n
  32768
  ,
 
  there should be more than enough FDs.  Note, we did not do
  any writings to index
  while we were searching.  Any ideas? Thx.
 
 
 
 
 --
 To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



Forked files? was: RE: Too many open files?

2002-07-23 Thread Scott Ganyo

Another idea to address this (quite common) problem:

Does anyone know if there are any Java file implementations that support a
forked file or a file with multiple streams?  Or, if not, do you know of
any design patterns or documents explaining the theory and design in this
kind of thing?  It would seem that if there was an efficient implementation
of a forked file, perhaps that could be used instead of the set of files
that Lucene currently uses to represent a segment.

Scott

 -Original Message-
 From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, July 23, 2002 10:13 AM
 To: 'Lucene Users List'
 Subject: RE: Too many open files?
 
 
 Are you closing the searcher after each when done?
 
 No: Waiting for the garbage collector is not a good idea.
 
 Yes: It could be a timeout on the OS holding the files handles.
 
 Either way, the only real option is to avoid thrashing the 
 searchers...
 
 Scott
 
  -Original Message-
  From: Hang Li [mailto:[EMAIL PROTECTED]]
  Sent: Tuesday, July 23, 2002 10:10 AM
  To: Lucene Users List
  Subject: Re: Too many open files?
  
  
  Thanks for your quick reponse, I still want to know why we 
 ran out of
  file descriptors.
  
  --Yup.  Cache and reuse your Searcher as much as possible.
  
  --Scott
  
   -Original Message-
   From: Hang Li [mailto:[EMAIL PROTECTED]]
   Sent: Tuesday, July 23, 2002 9:59 AM
   To: Lucene Users List
   Subject: Too many open files?
  
  
   
  
   I have seen a lot postings about this topic. Any final thoughts?
  
   We did a simple stress test, Lucene would produce this error
   between 30 - 80
   concurren searches.  The index directory has 24 files (15 
  fields), and
  
   
   ulimit -n
   32768
   ,
  
   there should be more than enough FDs.  Note, we did not do
   any writings to index
   while we were searching.  Any ideas? Thx.
  
  
  
  
  --
  To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 



RE: CachedSearcher

2002-07-16 Thread Scott Ganyo

I'd like to see the finalize() methods removed from Lucene entirely.  In a
system with heavy load and lots of gc, using finalize() causes problems.  To
wit:

1) I was at a talk at JavaOne last year where the gc performance experts
from Sun (the engineers actually writing the HotSpot gc) were giving
performance advice.  They specifically stated that finalize() should be
avoided if at all possible because the following steps have to happen for
finalized objects:
  a) register the object when created
  b) notice the object when it becomes unreachable
  c) finalize the object
  d) notice the object when it becomes unreachable (again)
  e) reclaim the object

This leads to the following effects in the vm:
  a) allocation is slower
  b) heap is bigger
  c) gc pauses are longer

The Sun engineers recommended that if you really do need an automatic clean
up process, that Weak references are *much* more efficient and should be
used in preference to finalize().

2) External resources (i.e. file handles) are not released until the reader
is closed.  And, as many have found, Lucene eats file handles for breakfast,
lunch, and dinner.

Scott

 -Original Message-
 From: Halcsy Pter [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, July 16, 2002 12:43 AM
 To: Lucene Users List
 Subject: RE: CachedSearcher
 
 
 
 
  -Original Message-
  From: Doug Cutting [mailto:[EMAIL PROTECTED]]
  Sent: Tuesday, July 16, 2002 1:00 AM
  To: Lucene Users List
  Subject: Re: CachedSearcher
  
  
  Why is this more complicated than the code in demo/Search.jhtml 
  (included below)?  FSDirectory closes files as they're GC'd, so you 
  don't have to explicitly close the IndexReaders or Searchers.
 
 I'll check this code, but I think it could hang up with a lot 
 of opened IndexReader.
 http://developer.java.sun.com/developer/TechTips/2000/tt0124.html
 
 (If a lot of searcher is requested ant a writer is always 
 modificating the index). 
 
 peter
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 



RE: CachedSearcher

2002-07-16 Thread Scott Ganyo

Point taken.  Indeed, these were general recommendations that may/may not
have a strong impact on Lucene's specific use of finalization.  My only
specific performance claim is that there will be a negative impact of some
degree using finalizers.  Whether that impact is noticable or not will
probably depend upon a number of factors.  So I will avoid making any
further judgements on the impact of finalization in Lucene on the
performance until I have proof.

Benchmarks aside, my point on the file handles is something that hit us
square between the eyes.  Before we started caching and explicitly closing
our Searchers we would regularly run out of file handles because of Lucene.
This was despite increasing our allocated file handles to ludicrous levels
in the OS.  I would recommend that, in general, Java developers would be
well advised to explicitly release external resources when done with them
rather than allowing finalization to take care of it.

Scott

 -Original Message-
 From: Doug Cutting [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, July 16, 2002 11:56 AM
 To: Lucene Users List
 Subject: Re: CachedSearcher
 
 
 Scott Ganyo wrote:
  I'd like to see the finalize() methods removed from Lucene 
 entirely.  In a
  system with heavy load and lots of gc, using finalize() 
 causes problems.
   [ ... ]
   External resources (i.e. file handles) are not released 
 until the reader
  is closed.  And, as many have found, Lucene eats file 
 handles for breakfast,
  lunch, and dinner.
 
 Lucene does open and close lots of files relative to many 
 other applications, 
 but the number of files opened is still many orders of 
 magnitude less than the 
 number of other objects allocated.  I would be very surprised 
 if finalizers for 
 the hundreds of files that Lucene might open in a session 
 would have any 
 measurable impact on garbage collector performance given the 
 millions of other 
 objects that the garbage collector might process in that session.
 
 As usual, one should not make performance claims without 
 performing benchmarks. 
   It would be a simple matter to comment out the finalize() 
 methods, recompile 
 and compare indexing and search speed.  If the improvement is 
 significant, then 
 we can consider removing finalize methods.
 
 Doug
 
 
 --
 To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



RE: IndexReader Pool

2002-07-08 Thread Scott Ganyo

Deadlocks could be created if the order in which locks are obtained is not
consistent.  Note, though, that the locks are obtained in the same order
each time throughout.  (BTW: The inner lock is merely needed because the
wait/notify calls need to own the monitor.)

Naturally, you are free to make any suggestions for improvement! :)

Scott

 -Original Message-
 From: Ilya Khandamirov [mailto:[EMAIL PROTECTED]]
 Sent: Saturday, July 06, 2002 11:24 AM
 To: 'Lucene Users List'
 Subject: RE: IndexReader Pool
 
 
 You are correct.  Actually, there have been a few bug fixes 
 since that
 was posted.
 Here's a diff to an updated version:
 
 Well, i do not see your actual version of this file, but it looks like
 now you have two synchronized blocks:
 
 synchronized ( sync )
   ...
 synchronized ( info )
 
 This may produce deadlocks in a multithreading environment. Have you
 already solved this problem or i should take a closer look at it?
 
 
 Hope it helps,
 
 Sure. Thank you.
 
 
 Scott
 
 Regards,
 Ilya
 
 
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 



RE: Stress Testing Lucene

2002-06-27 Thread Scott Ganyo

Which came first--the out of file handles error or the corruption?  I
haven't looked, but I would guess that if you ran into the file handles
exception while writing, that might leave Lucene in a bad state.  Lucene
isn't transactional and doesn't really have the ACID properties of a
database...

 -Original Message-
 From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, June 26, 2002 11:45 PM
 To: Lucene Users List
 Subject: RE: Stress Testing Lucene
 
 
 I rebooted my machine and still the same issue .. if I know
 what caused that to happen, I would be able to solve it with
 some source tweaking, and it's not the files handles on the machine I
 got over that problem months ago. Let's consider worst case 
 scenario and
 that
 corruption did occur what could be the reasons, I'm goig to need some
 insider
 help to get through this one.
 
 N.
 
 -Original Message-
 From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, June 26, 2002 7:15 PM
 To: 'Lucene Users List'
 Subject: RE: Stress Testing Lucene
 
 
 1) Are you sure that the index is corrupted?  Maybe the file 
 handles just
 haven't been released yet.  Did you try to reboot and try again?
 
 2) To avoid the too-many files problem: a) increase the 
 system file handle
 limits, b) make sure that you reuse IndexReaders as much as 
 you can rather
 across requests and client rather than opening and closing them.
 
  -Original Message-
  From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
  Sent: Wednesday, June 26, 2002 10:11 AM
  To: [EMAIL PROTECTED]
  Subject: Stress Testing Lucene
  Importance: High
 
 
 
  Hey people,
 
  I'm running a Lucene (v1.2) servlet on resin and I must say
  compared to
  Oracle Intermedia
  it's working beautifully. BUT today, I started stress testing and I
  downloaded a program called
  Web Roller, witch simulates clients, requests ,
  multi-threading .. the works
  and I was testing
  I was doing something like 50 simultaneous requests and I was
  repeating that
  10 times in a row.
 
  but then something happened and the index got corrupted,
  every time I try
  opening the index
  with the reader to search or open with the writer to optimize
  I get that
  damned too-many files
  open error. I can imagine that every application on the market has a
  breaking point and these breaking
  points have side effects, so is the corruption of the index a
  side effect
  and if so is there a way that
  I configure my web server to crash before the corruption
  occurs, I'd rather
  re-start the web server and
  throw some people off wack rather that have to re-build the
  index or revert
  to an older version.
 
  Do you know of any way to safeguard against this ?
 
  General Info:
  The index is about 45 MB with 60 000 XML files each
  containing 18-25 fields.
 
 
  Nader S. Henein
  Bayt.com , Dubai Internet City
  Tel. +9714 3911900
  Fax. +9714 3911915
  GSM. +9715 05659557
  www.bayt.com
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 
 
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 



RE: Stress Testing Lucene

2002-06-26 Thread Scott Ganyo

1) Are you sure that the index is corrupted?  Maybe the file handles just
haven't been released yet.  Did you try to reboot and try again?

2) To avoid the too-many files problem: a) increase the system file handle
limits, b) make sure that you reuse IndexReaders as much as you can rather
across requests and client rather than opening and closing them.

 -Original Message-
 From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, June 26, 2002 10:11 AM
 To: [EMAIL PROTECTED]
 Subject: Stress Testing Lucene
 Importance: High
 
 
 
 Hey people,
 
 I'm running a Lucene (v1.2) servlet on resin and I must say 
 compared to
 Oracle Intermedia
 it's working beautifully. BUT today, I started stress testing and I
 downloaded a program called
 Web Roller, witch simulates clients, requests , 
 multi-threading .. the works
 and I was testing
 I was doing something like 50 simultaneous requests and I was 
 repeating that
 10 times in a row.
 
 but then something happened and the index got corrupted, 
 every time I try
 opening the index
 with the reader to search or open with the writer to optimize 
 I get that
 damned too-many files
 open error. I can imagine that every application on the market has a
 breaking point and these breaking
 points have side effects, so is the corruption of the index a 
 side effect
 and if so is there a way that
 I configure my web server to crash before the corruption 
 occurs, I'd rather
 re-start the web server and
 throw some people off wack rather that have to re-build the 
 index or revert
 to an older version.
 
 Do you know of any way to safeguard against this ?
 
 General Info:
 The index is about 45 MB with 60 000 XML files each 
 containing 18-25 fields.
 
 
 Nader S. Henein
 Bayt.com , Dubai Internet City
 Tel. +9714 3911900
 Fax. +9714 3911915
 GSM. +9715 05659557
 www.bayt.com
 
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 



RE: Boolean Query + Memory Monster

2002-06-13 Thread Scott Ganyo

Use the java -Xmx option to increase your heap size.

Scott

 -Original Message-
 From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, June 13, 2002 12:20 PM
 To: [EMAIL PROTECTED]
 Subject: Boolean Query + Memory Monster
 
 
 
 I have 1 Geg of memory on the machine with the application 
 when I use a normal query it goes well, but when I use a range 
 query it sucks the memory out of the machine and throws a servlet 
 out of memory error, 
 I have 80 000 records in the index and it's 43 MB large
 
 anything people ?
 
 
 Nader S. Henein
 Bayt.com , Dubai Internet City
 Tel. +9714 3911900
 Fax. +9714 3911915
 GSM. +9715 05659557
 www.bayt.com
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 



RE: Queryparser croaking on [ and ]

2002-02-20 Thread Scott Ganyo

Actually, [] denotes an inclusive range of Terms.  Anyway, why not change
the syntax if this is bad...?

Scott

 -Original Message-
 From: Brian Goetz [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, February 20, 2002 10:08 AM
 To: Lucene Users List
 Subject: Re: Queryparser croaking on [ and ]
 
 
 This is because the query parser uses [] to denote ranges of numbers.
 (I always thought this was a bad choice of syntax for exactly this
 reason.)
 
 
 On Wed, Feb 20, 2002 at 11:14:05AM -, Les Hughes wrote:
  Hi,
  
  I'm currently building a small app that allows searching of 
 Java sourcecode.
  The problem I'm getting is when parsing a query string that 
 contains an
  array specifier (ie. String[] or int[][]) the query parser 
 seem to croak
  with a
  
  Lexical error at line XX, column XX. Encountered:   after : []
  
  
  So what am I doing wrong / what should I write to fix this?
  
  
  Les
  
  
  --
  To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
  For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 



RE: JDK 1.1 vs 1.2+

2002-01-22 Thread Scott Ganyo

+1

 -Original Message-
 From: Matt Tucker [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, January 22, 2002 11:06 AM
 To: 'Lucene Users List'
 Subject: RE: JDK 1.1 vs 1.2+
 
 
 Hey all,
 
 I'd just like to chime in support for dropping JDK 1.1, 
 especially if it
 would aid i18n in Lucene. There just doesn't seem to be a compelling
 reason to build anything for JDK 1.1 anymore.
 
 Regards,
 Matt
 Jive Software
 
  -Original Message-
  From: Andrew C. Oliver [mailto:[EMAIL PROTECTED]] 
  Sent: Tuesday, January 22, 2002 10:52 AM
  To: Lucene Users List
  Subject: JDK 1.1 vs 1.2+
  
  
  Hello everyone,
  
  I originally posted this question to the developers list, but 
  was asked to repeat it here.
  
  I'm working on some new functionality I plan to submit for 
  Lucene.  In doing this I've noticed that Lucene currently 
  maintains compatibility with JDK 1.1.  This has some 
  disadvantages for instance the use of vector versus some of 
  the new collections.  Next, some of the functionality I plan 
  to add requires JDK 1.2.  Finally, some of the 
  internationalization features of Java do not work well in 
  1.1.  For these reasons I suggest a move to 1.2+.  While it 
  seems reasonable to me to drop support for a 4 year old 
  version of the JDK, I realize it may still present a problem 
  to some users and would like to raise a discussion on this.
  
  How many people are still using 1.1 and would be negatively 
  affected by Lucene's use of 1.2 features?  Of those, how many 
  people can not move to 1.2 for server side development?
  
  -Andy
  -- 
  www.superlinksoftware.com
  www.sourceforge.net/projects/poi - port of Excel format to 
  java 
  http://developer.java.sun.com/developer/bugParade/bugs/4487555
 .html 
   - fix java generics!
 
 
 The avalanche has already started. It is too late for the pebbles to
 vote. -Ambassador Kosh
 
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 



Re: Industry Use of Lucene?

2001-12-06 Thread Scott Ganyo

We use Lucene extensively as a core part of our ASP product here at
eTapestry.  In fact, we've built our database query engine on top of
it.  We have been extremely pleased with the results.

Scott

Jeff Kunkle asks:
 Does anyone know of any companies or agencies using Lucene for their
 products/projects?  I am attempting to make a marketing pitch for
 Lucene to my manager and I know one of the first questions will be,
 Who else is using it?  I know Lucene is a very powerful, fast, and
 flexible full-text search engine but my manager will need a little
 more coercing.  Any help on this topic is greatly appreciated.



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Problems with prohibited BooleanQueries

2001-11-02 Thread Scott Ganyo

I don't use a query parser at all, so that's no issue.  I just need a
BooleanQuery to realize that it only has negative clauses and do the right
thing.  Right now I have to include a bogus static field in every single
document so that I can use a TermQuery on that bogus field as the left side
of a BooleanQuery subtract.  Sure, it works, but it ain't pretty...

Scott

 -Original Message-
 From: Doug Cutting [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, November 01, 2001 10:49 AM
 To: 'Lucene Users List'
 Subject: RE: Problems with prohibited BooleanQueries
 
 
  From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
  
  How difficult would it be to get BooleanQuery to do a 
  standalone NOT, do you
  suppose?  That would be very useful in my case.
 
 It would not be that difficult, but it would make queries 
 slow.  All terms
 not containing a term would need to be enumerated.  Since 
 most terms occur
 in only a small percentage of the documents, most NOT queries 
 would return
 most documents.
 
 Scoring would also be strange.  I guess you'd give them all a 
 score of 1.0,
 and hope that the query is nested in a more complex query that will
 differentiate the scores.  But if it's nested, then you could 
 do it with
 BooleanQuery as it stands...
 
 So, my question to you is: do you actually want lists of all 
 documents that
 do not contain a term, or, rather, do you want to use negation in the
 context of other query terms, and are having trouble getting 
 your query
 parser to build BooleanQueries?
 
 Doug
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]



RE: File Handles issue

2001-10-16 Thread Scott Ganyo

  P.S. At one point I tried doing an in-memory index using the 
  RAMDirectory
  and then merging it with an on-disk index and it didn't work.  The
  RAMDirectory never flushed to disk... leaving me with an 
  empty index.  I
  think this is because of a bug in the mechanism that is 
  supposed to copy the
  segments during the merge, but I didn't follow up on this.
 
 That should work, it should be faster and would use a lot 
 less memory than
 the approach you describe above.  Can you please submit a 
 simple test case
 illustrating the failure?  Something self-contained would be best.

Ok.  This will fail:

import java.io.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.store.*;

public class LuceneRAMDirectoryTest
{
public static void main(String args[])
{
try
{
// create index in RAM
RAMDirectory ramDirectory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter ramWriter = new IndexWriter(ramDirectory, analyzer,
true);
try
{
for (int i = 0; i  100; i++)
{
Document doc = new Document();
doc.add(Field.Keyword(field1,  + i));
ramWriter.addDocument(doc);
}
}
finally
{
ramWriter.close();
}

// then merge into file
File file = new File(index);
boolean missing = !file.exists();
if (missing) file.mkdir();
IndexWriter fileWriter = new IndexWriter(file, analyzer, true);
try
{
fileWriter.addIndexes(new Directory[]
{ ramDirectory });
}
finally
{
fileWriter.close();
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
}



RE: Trying To Understand Query Syntax Details

2001-10-16 Thread Scott Ganyo

Not sure about the rest, but if you've stored your dates in mmdd format,
you can use a RangeQuery like so:

dateField:[20011001-null]

This would return all dates on or after October 1, 2001.

Scott

 -Original Message-
 From: W. Eliot Kimber [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, October 16, 2001 11:10 AM
 To: lucene-user
 Subject: Trying To Understand Query Syntax Details
 
 
 I'm trying to understand the details of the query syntax. I found the
 syntax ` in QueryParser.jj, but it doesn't make everything clear.
 
 My initial questions:
 
 - It doesn't appear that ? can be the last character in a 
 search. For
 example, to match fool and food, I tried to do foo?, but got a
 parse error. fo?l of course matches fool and foal. Is this 
 a bug or an
 implementation constraint?
 
 - How does one specify a date range in a query? We need to be able to
 search on docs later than date x, and I know that Lucene 
 supports date
 matching, but I don't see how to specify this in a query.
 
 Also, is there a description of the algorithm ~ uses?
 
 Thanks,
 
 E.
 
 -- 
 . . . . . . . . . . . . . . . . . . . . . . . .
 
 W. Eliot Kimber | Lead Brain
 
 1016 La Posada Dr. | Suite 240 | Austin TX  78752
 T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]
 
 w w w . d a t a c h a n n e l . c o m
 



RE: File Handles issue

2001-10-15 Thread Scott Ganyo

Thanks for the detailed information, Doug!  That helps a lot.

Based on what you've said and on taking a closer look at the code, it looks
like by setting mergeFactor and maxMergeDocs to Integer.MAX_VALUE, an entire
index will be built in a single segment completely in memory (using the
RAMDirectory) and then flushed to disk when closed.  Given enough memory, it
would seem that this would be the fastest setting (as well as using a
minimum of file handles).  Would you agree?

Thanks,
Scott

P.S. At one point I tried doing an in-memory index using the RAMDirectory
and then merging it with an on-disk index and it didn't work.  The
RAMDirectory never flushed to disk... leaving me with an empty index.  I
think this is because of a bug in the mechanism that is supposed to copy the
segments during the merge, but I didn't follow up on this.



File Handles issue

2001-10-11 Thread Scott Ganyo

We're having a heck of a time with too many file handles around here.  When
we create large indexes, we often get thousands of temporary files in a
given index!  Even worse, we just plain run out of file handles--even on
boxes where we've upped the limits as much as we think we can!  We've played
around with various settings for the mergeFactor and maxMergeDocs, but these
seem to have at best an indirect effect on the number of temporary files
created.

I'm not very familiar with the Lucene file system yet, so can someone
briefly explain how Lucene works on creating an index?  How does it
determine when to create a new temporary file in the index and when does it
decide to compress the index?  Also, is there any way we could limit the
number of file handles used by Lucene?

This is becoming a huge problem for us, so any insight would be appreciated.

Thanks,
Scott