Re: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Justin Swanhart
You probably need to increase the amount of RAM available to your JVM.  

See the parameters:
-Xmx   :Maximum memory usable by the JVM
-Xms   :Initial memory allocated to JVM

My params are;  -Xmx2048m -Xms128m  (2G max, 128M initial)


On Fri, 10 Dec 2004 11:17:29 -0600, Sildy Augustine
[EMAIL PROTECTED] wrote:
 I think you should close your files in a finally clause in case of
 exceptions with file system and also print out the exception.
 
 You could be running out of file handles.
 
 
 
 -Original Message-
 From: Jin, Ying [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 10, 2004 11:15 AM
 To: [EMAIL PROTECTED]
 Subject: OutOfMemoryError with Lucene 1.4 final
 
 Hi, Everyone,
 
 We're trying to index ~1500 archives but get OutOfMemoryError about
 halfway through the index process. I've tried to run program under two
 different Redhat Linux servers: One with 256M memory and 365M swap
 space. The other one with 512M memory and 1G swap space. However, both
 got OutOfMemoryError at the same place (at record 898).
 
 Here is my code for indexing:
 
 ===
 
 Document doc = new Document();
 
 doc.add(Field.UnIndexed(path, f.getPath()));
 
 doc.add(Field.Keyword(modified,
 
 DateField.timeToString(f.lastModified(;
 
 doc.add(Field.UnIndexed(eprintid, id));
 
 doc.add(Field.Text(metadata, metadata));
 
 FileInputStream is = new FileInputStream(f);  // the text file
 
 BufferedReader reader = new BufferedReader(new
 InputStreamReader(is));
 
 StringBuffer stringBuffer = new StringBuffer();
 
 String line = ;
 
 try{
 
   while((line = reader.readLine()) != null){
 
 stringBuffer.append(line);
 
   }
 
   doc.add(Field.Text(contents, stringBuffer.toString()));
 
   // release the resources
 
   is.close();
 
   reader.close();
 
 }catch(java.io.IOException e){}
 
 =
 
 Is there anything wrong with my code or I need more memory?
 
 Thanks for any help!
 
 Ying
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: partial updating of lucene

2004-12-08 Thread Justin Swanhart
You unstored fields were not stored in the index, only their terms
were stored.  When you get the document from the index and modify it,
those terms are lost when you add the document again.

You can either simply create a new document and populate all the
fields and add that document to the index, or you can add the unstored
fields to the document retrieved in step 1.


On Wed, 8 Dec 2004 17:53:26 -0500, Praveen Peddi
[EMAIL PROTECTED] wrote:
 Hi all,
 I have a question about updating the lucene document. I know that there is no 
 API to do that now. So this is what I am doing in order to update the 
 document with the field title.
 
 1) Get the document from lucene index
 2) Remove a field called title and add the same field with a modified value
 3) Remove the docment (based on one of our field) using Reader and then close 
 the Reader.
 4) Add the document that is obtained in 1 and modified in 2.
 
 I am not sure if this is the right way of doing it but I am having problems 
 searching for that document after updating it. The problem is only with the 
 un stored fields.
 
 For example, I search as description:boy where description is a unstored, 
 indexed, tokenized field in the document. I find 1 document. Now I update the 
 document the document's title as descripbed above and repeat the same search 
 description:boy and now I don't find any results. I have not touched the 
 field description at all. I just updated the field title.
 
 Is this an expected behaviour? If not, is it a bug.
 
 If I change the field description as stored, indexed and tokenized, the 
 search works fine before and after updating.
 
 Praveen
 **
 Praveen Peddi
 Sr Software Engg, Context Media, Inc.
 email:[EMAIL PROTECTED]
 Tel:  401.854.3475
 Fax:  401.861.3596
 web: http://www.contextmedia.com
 **
 Context Media- The Leader in Enterprise Content Integration
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



corrupted index

2004-12-04 Thread Justin Swanhart
Somehow today one of my indexes became corrupted.  

I get the following IO exception when trying to open the index:
Exception in thread main java.io.IOException: read past EOF
at org.en.lucene.store.InputStream.refill(InputStream.java:154)
at org.en.lucene.store.InputStream.readByte(InputStream.java:43)
at org.en.lucene.store.InputStream.readVInt(InputStream.java:83)
at org.en.lucene.index.FieldInfos.read(FieldInfos.java:195)
at org.en.lucene.index.FieldInfos.init(FieldInfos.java:55)
at org.en.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
at org.en.lucene.index.SegmentReader.init(SegmentReader.java:94)
at org.en.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
at 
org.en.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)
at org.en.global.indexer2.Minnow.main(Minnow.java:142)

Any ideas on what could cause this type of corruption, and what I can
do to avoid it in the future.  Also, any ideas on repairing the index
if this happens?  I removed the index directory and marked the rows to
be reindexed from the database, but the data is unavailable to my
users while the index rebuilds.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Thread safety

2004-12-03 Thread Justin Swanhart
You can only have one open writer at a time.  A writer is either an
IndexWriter object, or an IndexReader object that has modified the
index, by deleting documents for instance.

You must close your existing writer before you open a new one.

You should not get lock exceptions with IndexSearchers.  The only time
the locks come into play is when you are trying to open a writer when
a writer process already has the lock, or the write process died w/out
removing the lock so you have a stale lock left behind.

I've run into FileNotFound exceptions on occasion, and have pretty
much pinned it down to modifying the index on a slow device (NFS) with
a very large index and trying to instantiate a new searcher.  I solved
the problem by catching the exception and trying to create the
searcher again.  That resolved the problem for me.


On Fri, 03 Dec 2004 08:58:41 +0100, sergiu gordea
[EMAIL PROTECTED] wrote:
 Otis Gospodnetic wrote:
 
 1. yes
 2. yes error, meaningful, it depends what you find meaningful :)
 3. searcher will still find the document, unless you close it and
 reopen it (searcher)
 
 
 ... What about LockException? I tried to index objects in a thread and
 to use a IndexSearcher
 to search objects, but I have had problems with this.
 I tried to create a new  IndexSearcher object if  the index version  was
 changed, but unfortunately
 I got some Lock Exceptions and FileNotFound Exceptions.
 
  If the answer number 3. is correct, then why did I get these exceptions.
 
  Sergiu
 
 
 
 Otis
 
 --- Zhang, Lisheng [EMAIL PROTECTED] wrote:
 
 
 
 Hi,
 
 I have an urgent question about thread safety in lucene,
 from lucene doc and code I could not get a clear answer.
 
 1. is Searcher (IndexSearcher, MultiSearcher ..) thread
 safe, can multi-users call search(..) method on the
 same object at the same time?
 
 2. if on the same object, one user calls close( ) and
 another calls search(..), I assume we should have a
 meaningful error message?
 
 3. what would happen if one user calls Searcher.search(..),
 but at the same time another user tries to delete that
 document from index files by calling IndexReader.delete(..)
 (either through two threads or two separate processes)?
 
 A brief answer would be good enough for me now, thanks
 very much in advance!
 
 Lisheng
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Justin Swanhart
On Tue, 30 Nov 2004 12:07:46 -, Pete Lewis [EMAIL PROTECTED] wrote:
 Also, unless you take your hyperthreading off, with just one index you are
 searching with just one half of the CPU - so your desktop is actually using
 a 1.5GHz CPU for the search.  So, taking account of this its not too
 surprising that they are searching at comparable speeds.
 
 HTH
 Pete

Actually, that isn't how hyperthreading works.  The second CPU in a
hyperthreaded system should only run threads when the main cpu is
waiting on another task, like a memory access.  The second, or sub CPU
is only a virtual processor.  There aren't really two chips on board. 
New multicore processors will actually have more than one processor 
in one chip.

Problems can arise when you are using a HT processor on an operating
system that doesn't know about HT technology.  The OS should only
schedule jobs to run on the sub CPU under very specific circumstances.
 This is one of the major reasons for the scheduler overhaul in Linux
2.6.  The default scheduler in 2.4 would assign threads to the sub CPU
that shouldn't have been, and those threads would suffer from resource
starvation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Justin Swanhart
As a generalisation, SuSE itself is not a lot slower than Windows XP. 
I also very much doubt that filesystem is a factor.  If you want to
test w/out filesystem involvement, simply load your index into a
RAMDirectory instead of using FSDirectory.  That precludes filesystem
overhead in searches.

There are quite a number of factors involved that could be affecting
performance.

First off, 1.8GHz Pentium-M machines are supposed to run at about the
speed of a 2.4GHz machine.  The clock speeds on the mobile chips are
lower, but they tend to perform much better than rated.   I recommend
you take a general benchmark of both machines testing both disk speed
and cpu speed to get a baseline performance comparision.  I also
suggest turning of HT for your benchmarks and performance testing.

Secondly, while the second machine appears to be twice as fast, the
disk could actually perform slower on the Linux box, especially if the
notebook drive has a big (8M) cache like most 7200RPM ata disk drives
do.  I imagine that if you hit the index with lots of simultaneous
searches, that the Linux box would hold its own for much longer than
the XP box simply due to the random seek performance of the scsi disk
combined with scsi command queueing.

RAM speed is a factor too.  Is the p4 a xeon processor?  The older HT
xeons have a much slower bus than the newer p4-m processors.  Memory
speed will be affected accordingly.

I haven't heard of a hard disk referred to as a winchester disk in a
very long time :)

Once you have an idea of how the two machines actually compare
performance-wise, you can then judge how they perform index
operations.  Until then, all your measurements are subjective and you
don't gain much by comparing the two indexing processes.

Justin

On Tue, 30 Nov 2004 02:04:46 -0800 (PST), Sanyi [EMAIL PROTECTED] wrote:
 Hi!
 
 I'm testing Lucene 1.4.2 on two very different configs, but with the same 
 index.
 I'm very surprised by the results: Both systems are searching at about the 
 same speed, but I'd
 expect (and I really need) to run Lucene a lot faster on my stronger config.
 
 Config #1 (a notebook):
 WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester
 
 Config #2 (a desktop PC):
 SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
 15000RPM U320 SCSI
 winchester
 
 You can see that the hardware of #2 is at least twice better/faster than #1.
 I'm searching the reason and the solution to take advantage of the better 
 hardware compared to the
 poor notebook.
 Currently #2 can't amazingly outperform the notebook (#1).
 
 The question is: What can be worse in #2 than on the poor notebook?
 
 I can imagine only software problems.
 Which are the sotware parts then?
 1. The OS
 Is SuSE 9.1 a LOT slower than WinXP pro?
 2. The file system
 Is reisefs a LOT slower than NTFS?
 
 Regards,
 Sanyi
 
 __
 Do you Yahoo!?
 Yahoo! Mail - You care about security. So do we.
 http://promotions.yahoo.com/new_mail
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index in RAM - is it realy worthy?

2004-11-28 Thread Justin Swanhart
My indexes are stored on a NetApp filter via NFS.  The indexer process
updates the indexes over NFS.  I have multiple indexes.  My search
process determines if the nfs indexes have been updated, and if they
have, then loads the index into a RAMDirectory.   RAMDirectory is of
course much faster than searching over NFS.  This way, I can also have
multiple search servers running easily.  The drawback of course is
startup time.  It takes a few minutes to start each search server
because it has to load the data into memory.  RAMDirectory also seems
to be kind of memory inneficient, using a lot more memory than the
data actually consumes on disk.


On Wed, 24 Nov 2004 14:26:40 -0800, Jonathan Hager [EMAIL PROTECTED] wrote:
 When comparing RAMDirectory and FSDirectory it is important to mention
 what OS you are using.  When using linux it will cache the most recent
 disk access in memory.  Here is a good article that describes its
 strategy: http://forums.gentoo.org/viewtopic.php?t=175419
 
 The 2% difference you are seeing is the memory copy.  With other OSes
 you may see a speed up when using the RAMDirectory, because not all
 OSes contain a disk cache in memory and must access the disk to read
 the index.
 
 Another consideration is there is currently a 2GB limitation with the
 size of the RAMDirectory.  Indexes over 2GB causes a overflow in the
 int used to create the buffer.  [see int len = (int) is.length(); in
 RamDirectory]
 
 I ended up using RAM directory for a very different reason.  The index
 is 1 to 2MB and is rebuilt every few hours.  It takes 3 to 4 minutes
 to query the database and rebuild the index.  But the search should be
 available 100% of the time.  Since the index is so small I do the
 following:
 
 on server startup:
 - look for semaphore, if it is there delete the index
 - if there is no index, build it to FSdirectory
 - load the index from FSDirectory into RAMDirectory
 
 on reindex:
 - create semaphore
 - rebuild index to FSDirectory
 - delete semaphore
 - load index from FSDirecttory into RAMDirectory
 
 to search:
 - search the RAMDirectory
 
 RAMDirectory could be replaced by a regular FSDirectory, but it seemed
 silly to copy the index from disk to disk, when it ultimately needs to
 be in memory.
 
 FSDirectory could be replaced by a RAMDirectory, but this means that
 it would take the server 3 to 4 minutes longer to startup every time.
 By persisting the index, this time would only be necessary if indexing
 was interrupted.
 
 Jonathan
 
 On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton
 
 
 [EMAIL PROTECTED] wrote:
  Otis Gospodnetic wrote:
 
  For the Lucene book I wrote some test cases that compare FSDirectory
  and RAMDirectory.  What I found was that with certain settings
  FSDirectory was almost as fast as RAMDirectory.  Personally, I would
  push FSDirectory and hope that the OS and the Filesystem do their share
  of work and caching for me before looking for ways to optimize my code.
  
  
  Yes... I performed the same benchmark and in my situation RAMDirectory
  for searches was about 2% slower.
 
  I'm willing to bet that it has to do with the fact that its a Hashtable
  and not a HashMap (which isn't synchronized).
 
  Also adding a constructor for the term size could make loading a
  RAMDirectory faster since you could prevent rehash.
 
  If you're on a modern machine your filesystme cache will end up
  buffering your disk anyway which I'm sure was happening in my situation.
 
  Kevin
 
  --
 
  Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
  invite!  Also see irc.freenode.net #rojo if you want to chat.
 
  Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
  If you're interested in RSS, Weblogs, Social Networking, etc... then you
  should work for Rojo!  If you recommend someone and we hire them you'll
  get a free iPod!
 
  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
  GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: java.io.FileNotFoundException: ... (No such file or directory)

2004-11-19 Thread Justin Swanhart
Is it possible that while my searcher process is reading the directory
that the index writer process performs a merge?  If that is so, then
the I think that the merge could remove segment files before they are
read by the
reader.  When the reader tries to read one of the now missing segment
files it throws the IOException.  That file was in the segments file
when the RAMDirectory started loading the directory, but now it is
missing because of the merge.  This would most likely not affect small
indexes, but large indexes like mine, especially over a network file
system could definitely be affected.

If this is what is happening, a way around it would be to open all the
files in the segment file when the segment file is read.  Then valid
file handles will be maintained for all the files that need to be
read.  If the index writer process removes a segment, then the file
handle should still be valid.  This might only work for local
filesystems though, I'm not sure if NFS works that way or not.



On Thu, 18 Nov 2004 19:16:46 -0500, Will Allen [EMAIL PROTECTED] wrote:
 I have gotten this a few times.  I am also using a NFS mount, but have seen 
 it in cases where a mount wasn't involved.
 
 I cannot speak to why this is happening, but I have posted to this forum 
 before a way of repairing your index by modifying the segments file.  Search 
 for wallen.
 
 The other thing I have done, is use code to copy the documents that can be 
 read by a reader to a new index.  I suppose I should submit those tools to 
 open source!
 
 Anyway, this error will break the searcher, but the index can still be read 
 with an indexreader.
 
 -Will
 
 Here is the source of a method that should get you started (logger is a log4j 
 object):
 
 public void transferDocuments()
 throws IOException
 {
 IndexReader reader = IndexReader.open(brokenDir);
 logger.debug(reader.numDocs() + );
 IndexWriter writer = new IndexWriter(newIndexDir, 
 PopIndexer.popAnalyzer(),true);
 writer.minMergeDocs = 50;
 writer.mergeFactor = 200;
 writer.setUseCompoundFile(true);
 int docCount = reader.numDocs();
 Date start = new Date();
 //docCount = Math.min(docCount, 500);
 for(int x=0; x  docCount; x++)
 {
 try
 {
 if(!reader.isDeleted(x))
 {
 Document doc = reader.document(x);
 if(x % 1000 == 0)
 {
 logger.debug(doc.get(subject));
 }
 //remove the new fields if they exist, and add new value
 //TODO test not having this in
 /*
 for ( Enumeration newFields = doc.fields(); 
 newFields.hasMoreElements(); )
 {
 Field newField = (Field) newFields.nextElement();
 doc.removeFields( newField.name() );
 doc.add( newField );
 }
 */
 doc.removeFields(counter);
 doc.add(Field.Keyword(counter, counter));
 //  reinsert old document
 writer.addDocument( doc );
 }
 }
 catch(IOException ioe)
 {
 logger.error(doc: + x +  failed,  + ioe.getMessage());
 }
 catch(IndexOutOfBoundsException ioobe)
 {
 logger.error(INDEX OUT OF BOUNDS! + ioobe.getMessage());
 ioobe.printStackTrace();
 }
 }
 reader.close();
 //logger.debug(done, about the optimize);
 //writer.optimize();
 writer.close();
 long time = ((new Date()).getTime() - start.getTime())/1000;
 logger.info(done optimizing:  + time +  seconds or  + (docCount / 
 time) +  rec/sec);
 
 
 }
 
 -Original Message-
 From: Justin Swanhart [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 18, 2004 5:00 PM
 To: Lucene Users List
 Subject: java.io.FileNotFoundException: ... (No such file or directory)
 
 I have two index processes.  One is an index server, the other is a
 search server.  The processes run on different machines.
 
 The index server is a single threaded process that reads from the
 database and adds
 unindexed rows to the index as needed.  It sleeps for a couple minutes
 between each
 batch to allow newly added/updated rows to accumulate.
 
 The searcher process keeps an open cache of IndexSearcher objects and
 is multithreaded.
 It accepts connections on a tcp port, runs the query and stores the
 results in a database.
 After a set interval, the server checks to see if the index on disk is
 a newer version.  If it is,
 it loads the index into a new IndexSearcher as a RAMDirectory.
 
 Every once in awhile, the index reader process gets a FileNotFoundException

java.io.FileNotFoundException: ... (No such file or directory)

2004-11-18 Thread Justin Swanhart
I have two index processes.  One is an index server, the other is a
search server.  The processes run on different machines.

The index server is a single threaded process that reads from the
database and adds
unindexed rows to the index as needed.  It sleeps for a couple minutes
between each
batch to allow newly added/updated rows to accumulate.

The searcher process keeps an open cache of IndexSearcher objects and
is multithreaded.
It accepts connections on a tcp port, runs the query and stores the
results in a database.
After a set interval, the server checks to see if the index on disk is
a newer version.  If it is,
it loads the index into a new IndexSearcher as a RAMDirectory.

Every once in awhile, the index reader process gets a FileNotFoundException:
20041118 1378 1383  (index number, old version, new version)
[newer version found] Loading index directory into RAM: 20041118
java.io.FileNotFoundException:
/path/omitted/for/obvious/reasons/_4zj6.cfs (No such file or
directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
at 
org.en.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376)
at org.en.lucene.store.FSInputStream.init(FSDirectory.java:405)
at org.en.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:60)
at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:89)
at 
org.en.global.searchserver.UpdateSearchers.createIndexSearchers(Search.java:89)
at org.en.global.searchserver.UpdateSearchers.run(Search.java:54)

the code being called at that point is:
//add the directory to the HashMap of IndexSearchers (dir# = IndexSearcher)
indexSearchers.put(subDirs[i],new IndexSearcher(new
RAMDirectory(indexDir + / + subDirs[i])));

The indexes are located on a NFS mountpoint. Could this be the
problem?  Or should I be looking elsewhere...  Should i just check for
an IOException, and try reloading the index if I get an error?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index copy

2004-11-17 Thread Justin Swanhart
You could lock your index for writes, then copy the file using
operating system copy commands.

Another way would be to lock your index, make a filesystem snapshot,
then unlock your index.  You can then safely copy the snapshot without
interupting further index operations.

On Wed, 17 Nov 2004 11:25:48 -0500, Ravi [EMAIL PROTECTED] wrote:
 Whats the bestway to copy an index from one directory to another? I
 tried opening an IndexWriter at the new location and used addIndexes to
 read from the old index. But that was very slow.
 
 Thanks in advance,
 Ravi.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Something missing !!???

2004-11-17 Thread Justin Swanhart
The HEAD version of CVS supports gz compression.  You will need to
check it out using cvs if you want to use it.


On Wed, 17 Nov 2004 21:43:36 +0200, abdulrahman galal [EMAIL PROTECTED] wrote:
 i noticed in the last period that alot of people disscus with each others
 about the bugs of lucene ...
 
 but something is missing ... i consider lucene is an indexing tool for text
 files and so one ...
 
 but there are alot of tools that makes this indexing like access ...
 
 what about compression ... compressing original text files and its indexes
 and performing indexing on them like (MG) system which is effecient in
 compression and indexing ...
 
 where all of that in Lucene  please help me
 
 if these requierments satisfied in Lucene please anyone notify me and send
 link of the new version...
 
 thanks alot ...
 
 _
 Express yourself instantly with MSN Messenger! Download today it's FREE!
 http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: version documents

2004-11-17 Thread Justin Swanhart
Split the filename into basefilename and version and make each a keyword.

Sort your query by version descending, and only use the first
basefile you encounter.

On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon
[EMAIL PROTECTED] wrote:
 Hey all;
 
 I have ran into an interesting case.
 
 Our system has notes. These need to be indexed. They are xml files called 
 default.xml and are easily parsed and indexed. No problem, have been doing it 
 all week.
 
 The problem is if someone edits the note, the system doesn't update the 
 default.xml. It creates a new file, default_1.xml (every edit creates a new 
 file with an incremented number, the sytem only displays the content from the 
 highest number).
 
 My problem is I index all the documents and end up with terms that were taken 
 out of note several version ago still showing up in the query. From my point 
 of view this makes sense because the files are still in the content. But to a 
 user it is confusing because they have no idea every change they make to a 
 note spans a new file and now the are seeing a term they removed from their 
 note 2 weeks ago showing up in a query.
 
 I have started modifying my incremental update to be look for multiple 
 version of the default.xml but it is more work than I thought and is going 
 make things complex.
 
 Maybe there is an easier way? If I just let it run and create the index, can 
 somebody suggest a way I could easily scan the index folder ensuring only the 
 default.xml with the highest number in its filename remains (only for folders 
 were there is more than one default.xml file)? Or is this wishful thinking?
 
 Thanks,
 
 Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser: [stopword] AND something throws Exception

2004-11-12 Thread Justin Swanhart
Try using 1.4.2.  The change file says that
ArrayIndexOutOfBoundsExceptions have been fixed in the queryparser.


On Fri, 12 Nov 2004 12:04:31 -0500, Will Allen [EMAIL PROTECTED] wrote:
 Holy cow!  This does happen!
 
 
 
 -Original Message-
 From: Peter Pimley [mailto:[EMAIL PROTECTED]
 Sent: Friday, November 12, 2004 11:52 AM
 To: Lucene Users List
 Subject: QueryParser: [stopword] AND something throws Exception
 
 [this is using lucene-1.4-final]
 
 Hello.
 
 I have just encountered a way to get the QueryParser to throw an
 ArrayIndexOutOfBoundsException.  It can be recreated with the demo
 org.apache.lucene.demo.SearchFiles program.  The way to trigger it is to
 parse a query of the form:
 
 a AND b
 
 ...where 'a' is a stop word.  For example, the AND vector.  It only
 happens when the -first- term is a stop word.  You could search for
 vector AND the or vector AND the AND class, and it works as you
 would expect (i.e. the stop words are ignored).
 
 Unfortunately I am up against a deadline right now so I can't fix this
 myself.  I'm just going to filter out stop words before feeding them to
 the query parser.  I'll try to have a look at it in roughly 2 weeks time
 if nobody else has solved it.
 
 Peter Pimley,
 Semantico
 
 Here is the stack trace.
 
 java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.Vector.elementAt(Vector.java:434)
 at
 org.apache.lucene.queryParser.QueryParser.addClause(QueryParser.java:181)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:529)
 at
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
 at
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
 at
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
 at
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87)
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching in keyword field ?

2004-11-09 Thread Justin Swanhart
You can add the category keyword multiple times to a document.

Instead of seperating your categories with a delimiter, just add the
keyword multiple times.

doc.add(Field.Keyword(category, ABC);
doc.add(Field.Keyword(category, DEF GHI);

On Tue, 9 Nov 2004 17:18:19 +0100, Thierry Ferrero (Itldev.info)
[EMAIL PROTECTED] wrote:
 Hi All,
 
 Can i search only one word in a keyword field which contains few words.
 I know keyword field isn't tokenized. After many tests, i think is
 impossible.
 Someone can confirm me ?
 
 Why don't i use a text field? because the users know the category from a
 list (ex: category ABC, category DEF GHI, category  JKL ...) and the keyword
 field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST).
 I use a SnowBallAnalyzer for text field in indexing.
 Perhaps the better way for me, is to use a text field with the value ABC
 DEF_GHI  JKL_NOPQ where categorys are concatinated with a _.
 Thanks for your reply !
 
 Thierry.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Windows Bug?

2004-11-08 Thread Justin Swanhart
The reason this is failing is because you are trying to create a new
index in the directory.  It works on *nix file systems because you can
delete an open file on those operating systems, something you can't do
under Windows.

If you change the create parameter to false on your second call
everything should work as you expect it to.


On 8 Nov 2004 18:27:12 -, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 Hi,
 
 My understanding is that I can have an IndexReader open for searching
 (as long as it doesn't delete) while an IndexWriter is updating the index.
 
 I wrote a simple test app to prove this and it works great on Mac OS
 X, Java 1.4.2 and Lucene 1.4.2.  It fails on Windows XP, Java 1.4.2 and Lucene
 1.4.2.  I tried other versions of Lucene and it failed in those too.
 
 This
 is the app that fails on Windows:
 
 public static void main(String[] args)
 
 throws Exception {
 
   String indexFolder = /TestIndex;
 
   // add
 a document to the index
 
   IndexWriter indexWriter = new IndexWriter
 
 (indexFolder,
 
  new StandardAnalyzer(), true);
 
   Document document =
 new Document();
 
   Field field = new Field(foo, bar,
 
true, true, true)
 
   document.add(field);
 
   indexWriter.addDocument(document);
 
   indexWriter.close();
 
   // open an index reader but don't close it
 
  IndexReader indexReader =
 
 IndexReader.open(indexFolder);
 
   // open
 an index writer
 
   indexWriter = new IndexWriter
 
 (indexFolder,
 
  new StandardAnalyzer(), true);
 
   indexWriter.close();
 
 }
 
 On Windows XP
 this throws an Exception as soon as it tries to open the IndexWriter after
 the IndexReader has been opened.
 
 Here's the stack trace:
 
 Exception in
 thread main java.io.IOException: Cannot delete _1.cfs
 
   at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144)
 
   at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:105)
 
   at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83)
 
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
 
 at scratch.TestLuceneLocks.main(TestLuceneLocks.java:17)
 
 Is this a bug?
 
 Thanks.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearch

2004-11-08 Thread Justin Swanhart
You can write to the index and read from it at the same time. You can
only have one IndexWriter open at any one time.

IndexSearchers will only see documents that were created before they
were instantiated, so you need to create new ones periodically to see
new documents.


On Mon, 8 Nov 2004 14:26:40 -0800, Ramon Aseniero
[EMAIL PROTECTED] wrote:
 Hi All,
 
 Can IndexSearcher be persisted? Are there any limitations on index updates
 while searches are in progress? Any file locking issues?
 
 Thanks,
 
 Ramon
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-05 Thread Justin Swanhart
You should exclude your lucene index from the CVS repository.  This is
the same thing you would do if you had a process that generated files
in your source tree from other files.  The generated files wouldn't
have any meaning in the repository, and can be regenerated at any
time, so you would want to exclude them.

You should be able to do this in your CVS modules file.  Check the CVS
manual for details, but I think you can just add !/path/to/exclude to
the list of paths in the module file.

for example:
modulename -a !/exclude/this/path /include/this/path


On Fri, 5 Nov 2004 09:03:00 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 Sergiu,
 
 The Lucene index is not in CVS -- neither the directory nor the files.
 But it is a subdirectory of a directory that is in CVS, and it needs to
 be structured that way due to the directory structure constraints of
 Tomcat and the way Netbeans automates Tomcat app development and
 deployment (which uses a development directory layout that directly
 parallels the Tomcat runtime layout).  I want to be able to Update the
 entire repository to make sure I've got all of the latest changes, which
 means doing CVS Update on an ancestor directory of the Lucene index
 directory.  Even though the index directory is not in CVS, doing the
 update on the ancestor directory consistently causes CVS to insert a CVS
 subdirectory into the index directory, causing the problem.  Both WinCVS
 and the Netbeans CVS client have this same behavior.  I have not been
 able to find any option to stop this -- do you know of one?
 
 Also, I can't just move the CVS directory out of the index directory,
 unless I'm very careful to move it back before every CVS Update.  For
 similar reasons I can't just delete it either.  CVS (and Netbeans) get
 very upset if there are points to this directory but it isn't there.
 The pointer exists in the CVS Entries file (and another for Netbeans in
 a cache file) in the CVS subdirectory of the parent directory of the
 index directory.  So, I have to manually eliminate those if I want to
 delete the index directory's CVS directory.  And then they come back
 after the next update!  All in all very frustrating.
 
 I'm going to try the code patch that Otis suggested.  If anybody knows
 some way in CVS to avoid this problem, I'd love to hear about it.
 
 Thanks,
 
 Chuck
 
 
 
-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED]
Sent: Friday, November 05, 2004 1:43 AM
To: Lucene Users List
Subject: Re: Is there an easy way to have indexing ignore a CVS
subdirectory in the index directory?
   
Chuck Williams wrote:
   
Otis, thanks for looking at this.  The stack trace of the exception
 is
below.  I looked at the code.  It wants to delete every file in the
index directory, but fails to delete the CVS subdirectory entry
(presumably because it is marked read-only; the specific exception
 is
swallowed).  Even if it could delete the CVS subdirectory, this
 would
just cause another problem with Netbeans/CVS, since it wouldn't
 know
how
to fix up the pointers in the parent CVS subdirectory.  Is there a
change I could make that would cause it to safely leave this alone?


Why do you have the lucene index in CVS? From what I know the lucene
index folder shouldn't contain any other folder,
just the lucene files.  I think it won't be any problem to delete
 CVS
folder from lucene index and to remove the index from CVS.
If you are affraid to do that .. you can move the CVS subfolder from
lucene index into another folder ... and restore if you have any
problems. I'm sure you will have no problem ... but this is just for
your trust...
   
 Sergiu
   
This problem only arises on a full index (incremental == false =
create == true).  Incremental indexes work fine in my app.

Chuck

java.io.IOException: Cannot delete CVS
at
 org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144)
at
 org.apache.lucene.store.FSDirectory.init(FSDirectory.java:128)
at
   
 org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102)
at
   
 org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83)
at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
at [my app]...

   -Original Message-
   From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 1:54 PM
   To: Lucene Users List
   Subject: Re: Is there an easy way to have indexing ignore a CVS
   subdirectory in the index directory?
  
   Hm, as far as I know, a CVS sub-directory in an index directory
should
   not bother Lucene.  As a matter of fact, I tested this (I used
 a
file,
   not a directory) for Lucene in Action.  What error are you
 getting?
  
   I know there is -I CVS option for ignoring files; perhaps it
 

Re: one huge index or many small ones?

2004-11-04 Thread Justin Swanhart
First off, I think you should make a decision about what you want to
store in your index and how you go about searching it.

The less information you store in your index, the better, for
performance reasons.  If you can store the messages in an external
database you probably should.  I would create a table that contains a
clob and an associated id that can be used to get the message at any
time.

Assuming mail is in SMTP RFC format:

I would suggest:
Unstored: Subject
Keyword: From
Keyword: To
Stored,Unindexed: ID  -- this would be the ID to the message in your database
Unstored: Body 
Keyword: Month
Keyword: Day
Keyword: Year
(and any other keywords you might use)

Your lucene query would then look something like:
+From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004

Use the stored ID field to get the message contents from your database.

If you want to break your index down into multiple indexes, based on
some criteria such as time frame you could do that too.  You would
then use a MultiSearcher or ParallelMultiSearcher to process the
multiple indexes.


On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza [EMAIL PROTECTED] wrote:
 Thanks Erik and Giulio for the fast reply.
 
 I am just starting to look at lucene so forgive me if I got some ideas
 wrong. I understand your concerns about one index per email. But
 having one index only is also (I guess) out of question.
 
 I am building an email archive. Email will be kept indefinitely
 available for search, adding new email every day. Imagine a company
 with millions of emails per day (been there), keep it growing for
 years, adding stuff to the index while using it for searches
 continuously...
 
 That's why my idea is to decide on a time frame (a day, a month...an
 extreme would be an instant, that is a single email, my original idea)
 and build the index for all the email in that timeframe. After the
 timeframe is finished no more stuff will be ever added.
 
 Before the lucene search emails are selected based on other conditions
 (we store the from, to, date etc in database as well, and these
 conditions are enforced with a sql query first, so I would not need to
 enforce them in the lucene search again, also that query can be quite
 sophisticated and I guess would not be easyly possible to do it in
 lucene by itself). That first db step gives me a group of emails that
 maybe I have to further narrow down based on a lucene search (of body
 and attachment contents). Having an index for more than one emails
 means that after the search I would have to get only the overlaping
 emails from the two searches...Maybe this is better than keeping the
 same info I have in the db in lucene fields as well.
 
 An example: I want all the email from [EMAIL PROTECTED] from Jan
 to Dec containing the word 'money'. I run the db query that returns a
 list with john's email for that period of time, then (lets assume I
 have one index per day) I iterate on every day, looking for emails
 that contain 'money', from the results returned by lucene I keep only
 these that are also in the first list.
 
 Does that sound better?
 
 On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
 
 
 [EMAIL PROTECTED] wrote:
  Hi Javier,
 
  I suggest you to build a single index, with all the information you
  need to find the right mail you are looking for. You than can use
  Lucene alone to find you messages.
 
  Giulio Cesare
 
 
 
 
  On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
   Hi,
  
   We are going to move from a just-in-time perl based search to using
   lucene in our project. I have to index emails (bodies and also
   attachements). I keep in the filesystem all the bodies and attachments
   for a long period of time. I have to find emails that fullfil certain
   conditions, some of the conditions are take care of at a different
   level, so in the end I have a SUBSET of emails I have to run through
   lucene.
  
   I was assuming that the best way would be to create an index for each
   email. Having an unique index for a group of emails (say a day worth
   of email) seems too coarse grained, imagine a day has 1 emails,
   and some queries will like to look in only a handful of the
   emails...But the problem with having one index per emails is the
   massive number of emails...imagine having 10 indexes
  
   Anyway, any idea about that? I just wanted to check wether someones
   feels I am wrong.
  
   Thanks
  
   -
 
 
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



prefix wildcard matching options (*blah)

2004-11-04 Thread Justin Swanhart
I'm thinking about making a seperate field in my index for prefix
wildcard searches.
I would chop off x characters from the front to create subtokens for
the prefix matches.

For the term: republican
terms created: republican epublican publican ublican blican

My query parser would then intelligently decide if their is a term
that has a wildcard as the first character of the term.  Instead of
searching the normal field, it would then remove the wildcard from the
start of the term and search on the prefix field instead.

A search for *pub* would be converted to pub* in the prefix field.  
A search for *blican would be converted to blican

Does this sound like an intelligent way to create fast prefix querying ability?

Can I index the prefix field with a seperate analyzer that makes the
prefix tokens, or should I just do the index-time expansion manually? 
I wouldn't need to search with this analyzer, just index with it,
because the searching doesn't have to expand all those terms.

If using a seperate analyzer for the prefix field makes more sense how
do I make a tokenizer that returns multiple tokens for one word?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search speed

2004-11-02 Thread Justin Swanhart
If you know all the phrases your are going to search for, you could
modify an analyzer to make those phrases into whole terms when you are
analyzing.

Other than that, you can test the speed of breaking the phrase query
up into term queries.  You would have to do an AND on all the words in
the phrase.  You would then need to
get the documents that match all the terms, then do a substring search
for your exact phrase.  Any documents that match you would then
return.

search: death  notice
for each hit
  if contents contains death notice
add hit to final result list
loop

On Tue, 2 Nov 2004 18:07:26 +0100, Paul Elschot [EMAIL PROTECTED] wrote:
 On Tuesday 02 November 2004 17:50, Jeff Munson wrote:
  Thanks for the info Paul.  The requirements of my search engine are that
  I need to search for phrases like death notice or world war ii.  You
  suggested that I break the phrases into words.  Is there a way to break
  the phrases into words, do the search, and just return the documents
  with the phrase?  I'm just looking for a way to speed up the phrase
  searches.
 
 If you know the phrases in advance, ie. before indexing, you can index
 and search them as terms with a special purpose analyzer.
 It's an unusual solution, though.
 
 
 
 Regards,
 Paul Elschot
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



When do document ids change

2004-10-29 Thread Justin Swanhart
Given an FSDirectory based index A.
Documents are added to A with an IndexWriter
  minMergeDocs = 2
  mergeFactor = 3

Documents are never deleted.

Once the RAMDirectory merges documents to the index:

a) will the documentID values for index A ever change?
b) can a mapping between a term in the document and newly created
documentID be made?

Why I am asking this question:
I have a database with about 10M rows in it.  My search engine needs
to be able to quickly
get all the rows back from the database that match a query.  All the
rows need to be
returned at once, because the entire result set is sorted based on user input.  

What I want to do:
When a documentID gets assigned to a document, I want to update the
database row with
that matches the document field id with the lucene documentID.  That
way, I can use a
hitcollector to gather just the documentID values from the search and
insert them into a
temporary cache table, then grab the matching rows from the database. 
This will work assuming the documentID values for the given document
never change.

Currently, running an IndexSearcher.search() and getting all the rows
back takes between
5 and 30 seconds for most queries, which is certainly not fast enough.
 The time it takes to collect the documentIDs however is less than 1
second.  All the time is taken by calling
hits.doc() for each document to get the id field to insert into the database. 

So finally,  will what I want to do work, and if so, how can I go
about updating the database when the documentID is created?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for a phrase that contains quote character

2004-10-28 Thread Justin Swanhart
Have you tried making a term query by hand and testing to see if it works?  

Term t = new Term(field, this is a \test\);
PhraseQuery pq = new PhraseQuery(t);
...



On Thu, 28 Oct 2004 12:02:48 -0400, Will Allen [EMAIL PROTECTED] wrote:
 
 I am having this same problem, but cannot find any help!
 
 I have a keyword field that sometimes includes double quotes, but I am unable to 
 search for that field because the escape for a quote doesnt work!
 
 I have tried a number of things:
 
 myfield:lucene is \cool\
 
 AND
 
 myfield:lucene is \\cool\\
 
 http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=7351
 
 From: [EMAIL PROTECTED] [EMAIL PROTECTED]
 Subject: Searching for a phrase that contains quote character
 Date: Wed, 24 Mar 2004 21:25:16 +
 
 I'd like to search for a phrase that contains the quote character. I've tried
 escaping the quote character, but am receiving a ParseException from the
 QueryParser:
 
 For example to search for the phrase:
 
  this is a test
 
 I'm trying the following
 
  QueryParser.parse(field:\This is a \\\test, field, new 
 StandardAnalyzer());
 
 This results in:
 
 org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 31.  
 Encountered: EOF after : 
 at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111)
 at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87)
 ...
 
 What is the proper way to accomplish this?
 
 --Dan
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for a phrase that contains quote character

2004-10-28 Thread Justin Swanhart
absolutely correct.  sorry about that.  shouldn't code before coffee :)


On Thu, 28 Oct 2004 20:16:16 +0200, Daniel Naber
[EMAIL PROTECTED] wrote:
 On Thursday 28 October 2004 19:03, Justin Swanhart wrote:
 
  Have you tried making a term query by hand and testing to see if it
  works?  
 
  Term t = new Term(field, this is a \test\);
  PhraseQuery pq = new PhraseQuery(t);
 
 That's not a proper PharseQuery, it searches for *one*
 term this is a test which is probably not what one wants. You
 have to add the terms one by one to a PhraseQuery.
 
 Regards
  Daniel
 
 --
 http://www.danielnaber.de
 
 -
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter Constructor question

2004-10-27 Thread Justin Swanhart
You could always modify your own local copy if you want to change the
behavior of the parameter.

or just do:
IndexWriter w = new IndexWriter(indexDirectory,
new StandardAnalyzer(),
   
!(IndexReader.indexExists(indexDirectory))
);

If you do that, then if an index exists then it will not be created,
otherwise it will be...

On Wed, 27 Oct 2004 12:26:29 -0500, Armbrust, Daniel C.
[EMAIL PROTECTED] wrote:
 Wouldn't it make more sense if the constructor for the IndexWriter always created an 
 index if it doesn't exist - and the boolean parameter should be clear (instead of 
 create)
 
 So instead of this (from javadoc):
 
 IndexWriter
 
 public IndexWriter(Directory d,
Analyzer a,
boolean create)
 throws IOException
 
 Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
 create is true, then a new, empty index will be created in d, replacing the index 
 already there, if any.
 
 Parameters:
 d - the index directory
 a - the analyzer to use
 create - true to create the index or overwrite the existing one; false to append 
 to the existing index
 Throws:
 IOException - if the directory cannot be read/written to, or if it does not 
 exist, and create is false
 
 We would have this:
 
 IndexWriter
 
 public IndexWriter(Directory d,
Analyzer a,
boolean clear)
 throws IOException
 
 Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
 clear is true, and a index exists at location d, then it will be erased, and a new, 
 empty index will be created in d.
 
 Parameters:
 d - the index directory
 a - the analyzer to use
 clear - true to overwrite the existing one; false to append to the existing index
 Throws:
 IOException - if the directory cannot be read/written to, or if it does not 
 exist.
 
 Its current behavior is kind of annoying, because I have an app that should never 
 clear an existing index, it should always append.  So I want create set to false.  
 But when I am starting a brand new index, then I have to change the create flag to 
 keep it from throwing an exception...  I guess for now I will have to write code to 
 check if a index actually has content yet, and if it doesn't, change the flag on the 
 fly.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stopwords in Exact phrase

2004-10-27 Thread Justin Swanhart
your analyzer will have removed the stopword when you indexed your documents, so
lucene won't be able to do this for you.

You will need to implement a second pass over the results returned by lucene and
check to see if the stopword is included, perhaps with String.indexOf()


On Wed, 27 Oct 2004 14:36:14 -0500, Ravi [EMAIL PROTECTED] wrote:
  Is there way to include stopwords in an exact phrase search? For
 example, when I search on Melbourne IT, Lucene only searches for
 Melbourne ignoring IT.
 
 Thanks,
 Ravi.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi + Parallel

2004-10-14 Thread Justin Swanhart
The overhead of creating that many searcher objects is going to far
outweigh any performance benefit you could possibly hope to gain by
splitting your index up.


On Thu, 14 Oct 2004 04:42:27 -0700 (PDT), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Search a single merged index.
 
 Otis
 
 
 
 --- Karthik N S [EMAIL PROTECTED] wrote:
 
  Hi
 
  Apologies..
 
 
Can somebody provide me Approximate answers   [ Which is Better
  choice ]
 
A search of  10,000 subindexes using  multisearcher
 
or
 
   a search on  One Single Merged Index [ merged 10,000 Sub indexes ]
 
 
  a) SubIndexes  10,000 (   future)
 
  b) Field to be searche upon   = 4
 
  c)Field type present in Indexed format = 15
 
  d)  RAM = 1GB
 
   e) O/s Linux [ Clustered Enviournament]
 
   f) Processor make AMD [Probably High End]
 
   g) WebServer Tomcat 5.0.x
 
 
 
 
1)Which would be Faster ???;
 
2)If not What is may be the Probable Solution.
 
 
  Karthik
 
 
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, October 13, 2004 3:53 PM
  To: Lucene Users List
  Subject: Re: Multi + Parallel
 
 
  On Oct 13, 2004, at 3:14 AM, Karthik N S wrote:
   I was Curious to Know the Difference between ParallelMultiSearcher
  and
   MultiSearcher ,
  
   1) Is the working internal functionality of these  are  same or
   different .
 
  They are different internally.  Externally they should return
  identical
  results and not appear different at all.
 
  Internally, ParallelMultiSearcher searches each index in a separate
  thread (searches wait until all threads finish before returning).
  In
  MultiSearcher, each index is searched serially.
 
  You will not likely see a benefit to using ParallelMultiSearcher
  unless
  your environment is specialized to accommodate multi-threading
  (multiple CPU's, indexes on separate drives that can operate
  independently, etc).
 
   2) In terms of time domain do these differ when searching same no
  of
   fields
   / words .
  
   3)What are the features used on each of  API.
 
  There is no external difference to using either implementation.
  Benchmark searches using both and see what is best, but generally
  MultiSeacher will be better in most environments as it avoids the
  overhead of starting up and managing multiple threads.
 
Erik
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Strategy for 20 million documents

2004-10-08 Thread Justin Swanhart
It depends on a lot of factors.  I myself use multiple indexes for
about 10M documents.
My documents are transient.  Each day I get about 400K and I remove
about 400K.  I
always remove an entire days documents at one time.  It is much
faster/easier to delete
the lucene index for the day that I am removing, then looping through
one big index and
removing the entries with the IndexReader.  Since my data is also
partitioned by day in
my database, I essentially do the same thing there with truncate table.

I use a ParallelMultiSearcher object to search the indexes.  I store
my indexes on a 14
disk 15k rpm  fibre channel RAID 1+0 array (striped mirrors).

I get very good performance in both updating and searching indexes.

On Fri, 8 Oct 2004 06:11:37 -0700 (PDT), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Jeff,
 
 These questions are difficult to answer, because the answer depends on
 a number of factors, such as:
 - hardware (memory, disk speed, number of disks...)
 - index complexity and size (number of fields and their size)
 - number of queries/second
 - complexity of queries
 etc.
 
 I would try putting everything in a single index first, and split it up
 only if I see performance issues.  Going from 1 index to N indices is
 not a lot of work (not a lot of Lucene-related code).  If searching 1
 big index is too slow, split your index, put each index on a separate
 disk, and use ParallelMultiSearcher
 (http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html)
 to search your indices.
 
 Otis
 
 
 
 
 --- Jeff Munson [EMAIL PROTECTED] wrote:
 
  I am a new user of Lucene.  I am looking to index over 20 million
  documents (and a lot more someday) and am looking for ideas on the
  best
  indexing/search strategy.
 
  Which will optimize the Lucene search, one index or multiple indexes?
  Do I create multiple indexes and merge them all together?  Or do I
  create multiple indexes and search on the multiple indexes?
 
  Any helpful ideas would be appreciated!
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzer reuse

2004-10-07 Thread Justin Swanhart
Yes you can reuse analyzers.  The only performance gain will come from
not having to create the objects and not having garbage collection
overhead.  I create one for each of my index reading threads.

On Thu, 07 Oct 2004 16:59:38 +, sam s [EMAIL PROTECTED] wrote:
 Hi,
 Can instance of an analyzer be reused?
 If yes then will it give any performance gain?
 
 sam
 
 _
 Add photos to your messages with MSN 8. Get 2 months FREE*.
 http://join.msn.com/?page=features/featuredemail
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



multiple threads

2004-10-01 Thread Justin Swanhart
As I understand it, if two writers try to acess the same index for
writing, then one of the writers should block waiting for a lock until
the lock timeout period expires, and then they will return a Lock
wait timeout exception.

I have a multithreaded indexing applications that writes into one of
multiple indexes depending on a hash value, and I intend to merge all
the hashes when the indexing finishes.  Locking usually works but
sometimes it doesn't and I get IO exceptions such as the following..

java.io.IOException: Cannot delete _19.fnm
at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:198)
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:157)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:389)
at org.en.global.indexer.IndexGroup.run(IndexGroup.java:387)


Any idea on why this could be happening?  I am using NFS currently,
but the problem appears on the local filesystem as well.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]