Re: URGENT: Help indexing large document set

2004-11-24 Thread Paul Elschot
On Wednesday 24 November 2004 00:37, John Wang wrote:
 Hi:
 
I am trying to index 1M documents, with batches of 500 documents.
 
Each document has an unique text key, which is added as a
 Field.KeyWord(name,value).
 
For each batch of 500, I need to make sure I am not adding a
 document with a key that is already in the current index.
 
   To do this, I am calling IndexSearcher.docFreq for each document and
 delete the document currently in the index with the same key:
  
while (keyIter.hasNext()) {
 String objectID = (String) keyIter.next();
 term = new Term(key, objectID);
 int count = localSearcher.docFreq(term);

To speed this up a bit make sure that the iterator gives
the terms in sorted order. I'd use an index reader instead
of a searcher, but that will probably not make a difference.

Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help to increase performance of adding documents.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-24 Thread Paul Elschot
On Wednesday 24 November 2004 01:31, Ken McCracken wrote:
 Hi,
 
 Thanks the pointers in your replies.  Would it be possible to include
 some sort of accrual scorer interface somewhere in the Lucene Query
 APIs?  This could be passed into a query similar to
 MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
 according to the implementor's discretion, to compute the overall
 score for a document.

The DisjunctionScorer is currently not part of Lucene.
You might try and subclass Similarity to provide what you need and
pass that to your Query.

I'm using a few subclasses of DisjunctionScorer to provide the actual
score value ao. for max and sum.
For each of these scorers,  I use a separate Query and Weight.
This gives a parallel class hierarchy for Query, Weight and Scorer.

I guess it's time to have a look at Design Patterns and/or Refactoring
on how to get rid of the parallel class hierarchy. That could also
involve some sort of accrual scorer and Lucene's Similarity.

Regards,
Paul Elschot

 -Ken
 
 On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] 
wrote:
  On Friday 12 November 2004 22:56, Chuck Williams wrote:
  
  
   I had a similar need and wrote MaxDisjunctionQuery and
   MaxDisjunctionScorer.  Unfortunately these are not available as a patch
   but I've included the original message below that has the code (modulo
   line breaks added by simple text email format).
  
   This code is functional -- I use it in my app.  It is optimized for its
   stated use, which involves a small number of clauses.  You'd want to
   improve the incremental sorting (e.g., using the bucket technique of
   BooleanQuery) if you need it for large numbers of clauses.
  
  When you're interested, you can also have a look here for
  yet another DisjunctionScorer:
  http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
  
  It has the advantage that it implements skipTo() so that it can
  be used as a subscorer of ConjunctionScorer, ie. it can be
  faster in situations like this:
  
  aa AND (bb OR cc)
  
  where bb and cc are treated by the DisjunctionScorer.
  When aa is a filter this can also be used to implement
  a filtering query.
  
  
  
  
   Re. Paul's suggested steps below, I did not integrate this with query
   parser as I didn't need that functionality (since I'm generating the
   multi-field expansions for which max is a much better scoring choice
   than sum).
  
   Chuck
  
   Included message:
  
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Monday, October 11, 2004 9:55 PM
   To: [EMAIL PROTECTED]
   Subject: Contribution: better multi-field searching
  
   The files included below (MaxDisjunctionQuery.java and
   MaxDisjunctionScorer.java) provide a new mechanism for searching across
   multiple fields.
  
  The maximum indeed works well, also when the fields differ a lot length.
  
  Regards,
  Paul
  
  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: fetching similar wordlist as given word

2004-11-24 Thread Chris Hostetter

:can I get the similar wordlist as output. so that I can show the end
:user in the column  ---   do you mean foam?
:How can I get similar word list in the given content?

This is a non trivial problem, because the definition of similar is
subject to interpretation.  I would look into various dictionary
implimentations, and see if you can find a good Java based dictionary that
can suggest alternatives based on an input string.

Once you have that, then you should be able to use IndexSearcher.docFreq
to find out how many docs contains each alternate word, and compare that
with the number of docs that contain the initial word ... if one of the
alternates has a significantly higher number of matches, then you suggest
it.


NOTE: The DICT protocol defines a client/server approach to providing
spell correction and definitions.  Maybe you can leverage some of the
spell correction code mentioned in the Server Software Written in Java
section of this doc...
http://www.dict.org/links.html
In particular, you might want to take a look at JavaDict's Database.match
function using the LevenshteinStrategy...
http://ktulu.com.ar/javadict/docs/ar/com/ktulu/dict/Database.html#match(java.lang.String,%20ar.com.ktulu.dict.strategies.Strategy)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help on the Query Parser

2004-11-24 Thread Daniel Naber
On Wednesday 24 November 2004 08:16, Morus Walter wrote:

 Lucene itself doesn't handle wildcards within phrases.

This can be added using PhrasePrefixQuery (which is slightly misnamed):
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/PhrasePrefixQuery.html

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index in RAM - is it realy worthy?

2004-11-24 Thread iouli . golovatyi
Thanks everybody for responds.

What else can essentially improve queries performance?
(I do not speak now about such things as keeping index optimized etc. - 
it's clear)

As I experiensed on my 2 cpu box,  during the query execution both 
processors were realy busy. The question is would it accelerate speed if I 
get 4 cpu box, 10 cpu...
I mean real performance boost (at least factor 10), not just %-ge.

Whould it help if I play with different query formulation, i.e.  a and (b 
or c) instead of  (b or c) and a

Regards,
j.







Kevin A. Burton [EMAIL PROTECTED]
22.11.2004 21:40
Please respond to Lucene Users List

 
To: Lucene Users List [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:Re: Index in RAM - is it realy worthy?
Category: 



Otis Gospodnetic wrote:

For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.
 

Also another note is that doing an index merge in memory is probably 
faster if you just use a RAMDirectory and perform addIndexes to it.

This would almost certainly be faster than optimizing on disk but I 
haven't benchmarked it.

Kevin

-- 

Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
 
Kevin A. Burton, Location - San Francisco, CA
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





RE: Re: Help on the Query Parser

2004-11-24 Thread Terence Lai
Hi Daniel,

I couldn't figure out how to use the PharsePrefixQuery with a phase like java* 
developer. It only provides method to add terms. Can a term contain wildcard 
character in lucene?

Thanks,
Terence

 On Wednesday 24 November 2004 08:16, Morus Walter wrote:
 
  Lucene itself doesn't handle wildcards within phrases.
 
 This can be added using PhrasePrefixQuery (which is slightly misnamed):
 http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/PhrasePrefixQuery.html
 
 Regards
  Daniel
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 




--
Get your free email account from http://www.trekspace.com
  Your Internet Virtual Desktop!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Re: Help on the Query Parser

2004-11-24 Thread Terence Lai
Hi Morus,

I want to search for the string like below:

- java developer
- javascript developer

By searching java*, it will return more than I want. That's why I am thinking 
java* developer.

Terence

 Terence Lai writes:
  
  Look likes that the wildcard query disappeared. In fact, I am expecting 
  text:java* 
 developer to be returned. It seems to me that the QueryParser cannot handle 
 the 
 wildcard within a quoted String.
  
 That's not just QueryParser. 
 Lucene itself doesn't handle wildcards within phrases.
 You could have a query text:java* developer if '*' isn't removed by the 
 analyzer. But it would only search for the token 'java*' not any expansion of 
 that. I guess this is not, what you want.
 
 Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 




--
Get your free email account from http://www.trekspace.com
  Your Internet Virtual Desktop!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: modifying existing index

2004-11-24 Thread Santosh
I am able to delete now the Index using the following

if(indexDir.exists())

{


IndexReader reader = IndexReader.open( indexDir );

uidIter = reader.terms(new Term(id, ));

while (uidIter.term() != null  uidIter.term().field() == id) {


reader.delete(uidIter.term());

uidIter.next();

}

reader.close();

}

where id  is the keyword field. But here also all the documents are
deleted. How can I modify my code and delete particular document with given
id





Iam creating the index in the following way

Document doc = new Document();

doc.add(Field.Text(text,text));

doc.add(Field.Keyword(id,Long.toString(id)));

doc.add(Field.Keyword(title,title));

doc.add(Field.Keyword(keywords,keywords));

doc.add(Field.Keyword(type,type));

writer.addDocument(doc);









- Original Message -
From: Chuck Williams [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 24, 2004 1:06 PM
Subject: RE: modifying existing index


A good way to do this is to add a keyword field with whatever unique id
you have for the document.  Then you can delete the term containing a
unique id to delete the document from the index (look at
IndexReader.delete(Term)).  You can look at the demo class IndexHTML to
see how it does incremental indexing for an example.

Chuck

   -Original Message- From: Santosh
[mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 11:34
PM To: Lucene Users List Subject: Re: modifying existing index  I have
gon through IndexReader , I got method : delete(int docNum)   , but
from where I will get document number? Is  this predifined? or
we have to give a number prior  to indexing? - Original
Message - From: Luke Francl [EMAIL PROTECTED] To: Lucene
Users List [EMAIL PROTECTED] Sent: Wednesday, November 24,
2004 1:26 AM Subject: Re: modifying existing indexOn Tue,
2004-11-23 at 13:59, Santosh wrote:   I am using lucene for indexing,
when I am creating Index the docuemnts are added. but when I want to
modify the single existing document
and reIndex again, it is taking as new document and adding one more
time, so that I am getting same document twice in the results.   To
overcome this I am deleting existing Index and again
recreating whole Index. but is it possibe to index  the modified document
again and overwrite existing document without deleting and recreation. can
I do this? If
so how?   You do not need to recreate the whole index. Just mark the
document as  deleted using the IndexReader and then add it again with the
 IndexWriter. Remember to close your IndexReader and IndexWriter
after  doing this.   The deleted document will be removed the next
time you optimize
your  index.   Luke Francl   
-  To
unsubscribe, e-mail: [EMAIL PROTECTED]  For
additional commands, e-mail:
[EMAIL PROTECTED]   
- To
unsubscribe, e-mail: [EMAIL PROTECTED] For
additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: modifying existing index

2004-11-24 Thread Chuck Williams
I haven't tried it but believe this should work:

IndexReader reader;
void delete(long id) {
reader.delete(new Term(id, Long.toString(id)));
}

This also has the benefit that it does binary search rather than
sequential search.

You will want to pad you id's with leading zeroes if you are going to do
incremental indexing (both when storing them and when looking them up).
Sorting is by lexicographic order, not numerical order, and incremental
indexing is much faster if the id's are kept sorted (as is done in
IndexHTML).

Chuck


   -Original Message-
   From: Santosh [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 9:54 AM
   To: Lucene Users List
   Subject: Re: modifying existing index
   
   I am able to delete now the Index using the following
   
   if(indexDir.exists())
   
   {
   
   
   IndexReader reader = IndexReader.open( indexDir );
   
   uidIter = reader.terms(new Term(id, ));
   
   while (uidIter.term() != null  uidIter.term().field() == id) {
   
   
   reader.delete(uidIter.term());
   
   uidIter.next();
   
   }
   
   reader.close();
   
   }
   
   where id  is the keyword field. But here also all the documents
are
   deleted. How can I modify my code and delete particular document
with
   given
   id
   
   
   
   
   
   Iam creating the index in the following way
   
   Document doc = new Document();
   
   doc.add(Field.Text(text,text));
   
   doc.add(Field.Keyword(id,Long.toString(id)));
   
   doc.add(Field.Keyword(title,title));
   
   doc.add(Field.Keyword(keywords,keywords));
   
   doc.add(Field.Keyword(type,type));
   
   writer.addDocument(doc);
   
   
   
   
   
   
   
   
   
   - Original Message -
   From: Chuck Williams [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 1:06 PM
   Subject: RE: modifying existing index
   
   
   A good way to do this is to add a keyword field with whatever unique
id
   you have for the document.  Then you can delete the term containing
a
   unique id to delete the document from the index (look at
   IndexReader.delete(Term)).  You can look at the demo class IndexHTML
to
   see how it does incremental indexing for an example.
   
   Chuck
   
  -Original Message- From: Santosh
   [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004
   11:34
   PM To: Lucene Users List Subject: Re: modifying existing index 
I
   have
   gon through IndexReader , I got method : delete(int docNum)
,
   but
   from where I will get document number? Is  this predifined? or
   we have to give a number prior  to indexing? - Original
   Message - From: Luke Francl [EMAIL PROTECTED] To:
   Lucene
   Users List [EMAIL PROTECTED] Sent: Wednesday,
November
   24,
   2004 1:26 AM Subject: Re: modifying existing indexOn Tue,
   2004-11-23 at 13:59, Santosh wrote:   I am using lucene for
indexing,
   when I am creating Index the docuemnts are added. but when I want
to
   modify the single existing document
   and reIndex again, it is taking as new document and adding one more
   time, so that I am getting same document twice in the results.  
To
   overcome this I am deleting existing Index and again
   recreating whole Index. but is it possibe to index  the modified
   document
   again and overwrite existing document without deleting and
recreation.
   can
   I do this? If
   so how?   You do not need to recreate the whole index. Just
mark
   the
   document as  deleted using the IndexReader and then add it again
with
   the
IndexWriter. Remember to close your IndexReader and IndexWriter
   after  doing this.   The deleted document will be removed the
next
   time you optimize
   your  index.   Luke Francl   
  
- 
   To
   unsubscribe, e-mail: [EMAIL PROTECTED] 
For
   additional commands, e-mail:
   [EMAIL PROTECTED]   
  
-
   To
   unsubscribe, e-mail: [EMAIL PROTECTED] For
   additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-24 Thread John Wang
Thanks Paul!

Using your suggestion, I have changed the update check code to use
only the indexReader:

try {
  localReader = IndexReader.open(path);

  while (keyIter.hasNext()) {
key = (String) keyIter.next();
term = new Term(key, key);
TermDocs tDocs = localReader.termDocs(term);
if (tDocs != null) {
  try {
while (tDocs.next()) {
  localReader.delete(tDocs.doc());
}
  } finally {
tDocs.close();
  }
}
  }
} finally {

  if (localReader != null) {
localReader.close();
  }

}


Unfortunately it didn't seem to make any dramatic difference.

I also see the CPU is only 30-50% busy, so I am guessing it's spending
a lot of time in IO. Anyway of making the CPU work harder?

Is batch size of 500 too small for 1 million documents?

Currently I am seeing a linear speed degredation of 0.3 milliseconds
per document.

Thanks

-John


On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot [EMAIL PROTECTED] wrote:
 On Wednesday 24 November 2004 00:37, John Wang wrote:
 
 
  Hi:
 
 I am trying to index 1M documents, with batches of 500 documents.
 
 Each document has an unique text key, which is added as a
  Field.KeyWord(name,value).
 
 For each batch of 500, I need to make sure I am not adding a
  document with a key that is already in the current index.
 
To do this, I am calling IndexSearcher.docFreq for each document and
  delete the document currently in the index with the same key:
 
 while (keyIter.hasNext()) {
  String objectID = (String) keyIter.next();
  term = new Term(key, objectID);
  int count = localSearcher.docFreq(term);
 
 To speed this up a bit make sure that the iterator gives
 the terms in sorted order. I'd use an index reader instead
 of a searcher, but that will probably not make a difference.
 
 Adding the documents can be done with multiple threads.
 Last time I checked that, there was a moderate speed up
 using three threads instead of one on a single CPU machine.
 Tuning the values of minMergeDocs and maxMergeDocs
 may also help to increase performance of adding documents.
 
 Regards,
 Paul Elschot
 
 -
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many open files issue

2004-11-24 Thread John Wang
I have also seen this problem.

In the Lucene code, I don't see where the reader speicified when
creating a field is closed. That holds on to the file.

I am looking at DocumentWriter.invertDocument()

Thanks

-John


On Mon, 22 Nov 2004 16:21:35 -0600, Chris Lamprecht
[EMAIL PROTECTED] wrote:
 A useful resource for increasing the number of file handles on various
 operating systems is the Volano Report:
 
 http://www.volano.com/report/
 
 
 
  I had requested help on an issue we have been facing with the Too many
  open files Exception garbling the search indexes and crashing the
  search on the web site.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: URGENT: Help indexing large document set

2004-11-24 Thread Chuck Williams
Does keyIter return the keys in sorted order?  This should reduce seeks,
especially if the keys are dense.

Also, you should be able to localReader.delete(term) instead of
iterating over the docs (of which I presume there is only one doc since
keys are unique).  This won't improve performance as
IndexReader.delete(Term) does exactly what your code does, but it will
be cleaner.

A linear slowdown with number of docs doesn't make sense, so something
else must be wrong.  I'm not sure what the default buffer size is (it
appears it used to be 128 but is dynamic now I think).  You might find
the slowdown stops after a certain point, especially if you increase
your batch size.

Chuck

   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 12:21 PM
   To: Lucene Users List
   Subject: Re: URGENT: Help indexing large document set
   
   Thanks Paul!
   
   Using your suggestion, I have changed the update check code to use
   only the indexReader:
   
   try {
 localReader = IndexReader.open(path);
   
 while (keyIter.hasNext()) {
   key = (String) keyIter.next();
   term = new Term(key, key);
   TermDocs tDocs = localReader.termDocs(term);
   if (tDocs != null) {
 try {
   while (tDocs.next()) {
 localReader.delete(tDocs.doc());
   }
 } finally {
   tDocs.close();
 }
   }
 }
   } finally {
   
 if (localReader != null) {
   localReader.close();
 }
   
   }
   
   
   Unfortunately it didn't seem to make any dramatic difference.
   
   I also see the CPU is only 30-50% busy, so I am guessing it's
spending
   a lot of time in IO. Anyway of making the CPU work harder?
   
   Is batch size of 500 too small for 1 million documents?
   
   Currently I am seeing a linear speed degredation of 0.3 milliseconds
   per document.
   
   Thanks
   
   -John
   
   
   On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot
   [EMAIL PROTECTED] wrote:
On Wednesday 24 November 2004 00:37, John Wang wrote:
   
   
 Hi:

I am trying to index 1M documents, with batches of 500
documents.

Each document has an unique text key, which is added as a
 Field.KeyWord(name,value).

For each batch of 500, I need to make sure I am not adding a
 document with a key that is already in the current index.

   To do this, I am calling IndexSearcher.docFreq for each
document
   and
 delete the document currently in the index with the same key:

while (keyIter.hasNext()) {
 String objectID = (String) keyIter.next();
 term = new Term(key, objectID);
 int count = localSearcher.docFreq(term);
   
To speed this up a bit make sure that the iterator gives
the terms in sorted order. I'd use an index reader instead
of a searcher, but that will probably not make a difference.
   
Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help to increase performance of adding documents.
   
Regards,
Paul Elschot
   
   
-
   
   
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index in RAM - is it realy worthy?

2004-11-24 Thread Jonathan Hager
When comparing RAMDirectory and FSDirectory it is important to mention
what OS you are using.  When using linux it will cache the most recent
disk access in memory.  Here is a good article that describes its
strategy: http://forums.gentoo.org/viewtopic.php?t=175419

The 2% difference you are seeing is the memory copy.  With other OSes
you may see a speed up when using the RAMDirectory, because not all
OSes contain a disk cache in memory and must access the disk to read
the index.

Another consideration is there is currently a 2GB limitation with the
size of the RAMDirectory.  Indexes over 2GB causes a overflow in the
int used to create the buffer.  [see int len = (int) is.length(); in
RamDirectory]

I ended up using RAM directory for a very different reason.  The index
is 1 to 2MB and is rebuilt every few hours.  It takes 3 to 4 minutes
to query the database and rebuild the index.  But the search should be
available 100% of the time.  Since the index is so small I do the
following:

on server startup:
- look for semaphore, if it is there delete the index
- if there is no index, build it to FSdirectory
- load the index from FSDirectory into RAMDirectory

on reindex:
- create semaphore
- rebuild index to FSDirectory
- delete semaphore
- load index from FSDirecttory into RAMDirectory

to search:
- search the RAMDirectory

RAMDirectory could be replaced by a regular FSDirectory, but it seemed
silly to copy the index from disk to disk, when it ultimately needs to
be in memory.

FSDirectory could be replaced by a RAMDirectory, but this means that
it would take the server 3 to 4 minutes longer to startup every time. 
By persisting the index, this time would only be necessary if indexing
was interrupted.

Jonathan

On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton
[EMAIL PROTECTED] wrote:
 Otis Gospodnetic wrote:
 
 For the Lucene book I wrote some test cases that compare FSDirectory
 and RAMDirectory.  What I found was that with certain settings
 FSDirectory was almost as fast as RAMDirectory.  Personally, I would
 push FSDirectory and hope that the OS and the Filesystem do their share
 of work and caching for me before looking for ways to optimize my code.
 
 
 Yes... I performed the same benchmark and in my situation RAMDirectory
 for searches was about 2% slower.
 
 I'm willing to bet that it has to do with the fact that its a Hashtable
 and not a HashMap (which isn't synchronized).
 
 Also adding a constructor for the term size could make loading a
 RAMDirectory faster since you could prevent rehash.
 
 If you're on a modern machine your filesystme cache will end up
 buffering your disk anyway which I'm sure was happening in my situation.
 
 Kevin
 
 --
 
 Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
 invite!  Also see irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
 If you're interested in RSS, Weblogs, Social Networking, etc... then you
 should work for Rojo!  If you recommend someone and we hire them you'll
 get a free iPod!
 
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using multiple analysers within a query

2004-11-24 Thread Kauler, Leto S
Hi again,

Thanks for everyone who replied.  The PerFieldAnalyzerWrapper was a good
suggestion, and one I had overlooked, but for our particular
requirements it wouldn't quite work so I went with overriding
getFieldQuery().

You were right, Paul.  In 1.4.2 a whole heap of QueryParser changes were
made, mostly removing the analyzer parameter from methods.

In the end I built my changes on top of the NewMultiFieldQueryParser
which was shared here recently and works wonders -- thanks Bill Janssen
and sergiu gordea.  I added support for slops and boosts to build
together with the multi-fields array, and then overrode getFieldQuery to
check the queryText for a start char (= for example) and if found
remove it and switch to a non-tokenising analyser.

Then I found that because that analyser always returns a single token
(TermQuery) it would send through spaces into the final query string,
causing problems.  So also in getFieldQuery I check if it needs breaking
up and converting into a PhraseQuery.

Seems to work, just needs thorough testing.  If anyone would like a copy
I could post it up here.

Regards, --Leto
(excuse the disclaimer...)



 We have the need for analysed and 'not analysed/not tokenised' clauses

 within one query.  Imagine an unparsed query like:
 
 +title:Hello World +path:Resources\Live\1
 
 In the above example we would want the first clause to use 
 StandardAnalyser and the second to use an analyser which returns the 
 term as a single token.  So a parsed result might look like:
 
 +(title:hello title:world) +path:Resources\Live\1
 
 Would anyone have any suggestions on how this could be done?  I was 
 thinking maybe the QueryParser would have to be changed/extended to 
 accept a separator other than colon :, something like = for 
 example to indicate this clause is not to be tokenised.  Or perhaps 
 this can all be done using a single analyser?

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using multiple analysers within a query

2004-11-24 Thread Kauler, Leto S
Actually, just realised a PhraseQuery is incorrect...
I only want a single TermQuery but it just needs to be quoted, d'oh.


-Original Message-
Then I found that because that analyser always returns a single token
(TermQuery) it would send through spaces into the final query string,
causing problems.  So also in getFieldQuery I check if it needs breaking
up and converting into a PhraseQuery.

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]