Re: new version of NewMultiFieldQueryParser

2004-10-27 Thread sergiu gordea
Bill Janssen wrote:
I'm not sure this solution is very robust 
I think I already sent an email with a better code...
 Sergiu
Thanks to something Doug said when I first opened this discussion, I
went back and looked at my implementation.  He said, Can't we just do
this in getFieldQuery?.  Figuring that he probably knew what he was
talking about, I looked a bit harder, and it turns out he was right.
Here's a much simpler version of NewMultiFieldQueryParser that seems
to work.
[For those just tuning in, this is a version of MultiFieldQueryParser
that will work with a default query operator of AND, as well as with
OR.]
Enjoy!
Bill
class NewMultiFieldQueryParser extends QueryParser {
   static private final String DEFAULT_FIELD = %%;
   protected String[] fieldnames = null;
   private Analyzer analyzer = null;
   public NewMultiFieldQueryParser (Analyzer a) {
   super(DEFAULT_FIELD, a);
   }
   public NewMultiFieldQueryParser (String[] f, Analyzer a) {
   super(DEFAULT_FIELD, a);
   fieldnames = f;
   analyzer = a;
   }
   public void setFieldNames (String[] f) {
   fieldnames = f;
   }
   protected Query getFieldQuery (String field,
  Analyzer a,
  String queryText)
   throws ParseException {
   Query x = super.getFieldQuery(field, a, queryText);
   if (field == DEFAULT_FIELD  (fieldnames != null)) {
   BooleanQuery q2 = new BooleanQuery();
   if (x instanceof PhraseQuery) {
   Term[] terms = ((PhraseQuery)x).getTerms();
   for (int i = 0;  i  fieldnames.length;  i++) {
   PhraseQuery q3 = new PhraseQuery();
   q3.setSlop(((PhraseQuery)x).getSlop());
   for (int j = 0;  j  terms.length;  j++) {
   q3.add(new Term(fieldnames[i], terms[j].text()));
   }
   q2.add(q3, false, false);
   }
   } else if (x instanceof TermQuery) {
   String text = ((TermQuery)x).getTerm().text();
   for (int i = 0;  i  fieldnames.length;  i++) {
   q2.add(new TermQuery(new Term(fieldnames[i], text)), false, false);
   }
   }
   return q2;
   }
   return x;
   }
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Backup strategies

2004-10-27 Thread Christoph Kiehl
Hi,
I'm curious about your strategy to backup indexes based on FSDirectory. 
If I do a file based copy I suspect I will get corrupted data because of 
concurrent write access.
My current favorite is to create an empty index and use 
IndexWriter.addIndexes() to copy the current index state. But I'm not 
sure about the performance of this solution.

How do you make your backups?
Regards,
Christoph
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Backup strategies

2004-10-27 Thread Christiaan Fluit
Christoph Kiehl wrote:
I'm curious about your strategy to backup indexes based on FSDirectory. 
If I do a file based copy I suspect I will get corrupted data because of 
concurrent write access.
My current favorite is to create an empty index and use 
IndexWriter.addIndexes() to copy the current index state. But I'm not 
sure about the performance of this solution.
I have no practical experience with backing up an online index, but I 
would try to find out the details of the write lock mechanism used by 
Lucene at the file level. You can then create a backup component that 
write-locks the index and does a regular file copy of the index dir. 
During backup time searches can continue while updates will be 
temporarily blocked.

But as I said, I'm only speculating...
Chris
--
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Backup strategies

2004-10-27 Thread Christoph Kiehl
Christiaan Fluit wrote:
I have no practical experience with backing up an online index, but I 
would try to find out the details of the write lock mechanism used by 
Lucene at the file level. You can then create a backup component that 
write-locks the index and does a regular file copy of the index dir. 
During backup time searches can continue while updates will be 
temporarily blocked.
The problem with this approach is that this will not only block write 
operations but you will get timeouts for these operations which will 
lead to exceptions. To prevent this you must implement some queuing, 
which is what I would like avoid.

Regards,
Christoph
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Boost value

2004-10-27 Thread Michael Hartmann
Hello,

I am working on Lucene and tried to understand the calculation of the score
value. As far as I understand it works as follows:

(1) idf = ln(numDocs/(docFreq+1))

(2) queryWeight = idf * boost

(3) sumOfSquaredWeights = queryWeight * queryWeight

(4) norm = 1/sqrt(sumOfSquaredWeights)


??? Question 1: why not

norm = 1/queryWeight


(5) queryWeight' = queryWeight * norm

(6) weightValue = queryWeight' * idf


??? Question 2: using (6) and insert (1) - (5) step by step

= weightValue = idf


I did only pure algebraical substitutions and it all comes to a simple
formula. The boost value is not needed anymore. Where is my fault?

Thanks,
Michael




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing process causes Tomcat to stop working

2004-10-27 Thread Aad Nales
James,

How do you kick off your reindex? Could it be a session timeout? 

cheers,
Aad


Hello,

I am a Java/Lucene/Tomcat newbie I know that does not bode well as a
start 
to a post but I really am in dire straits as far as Lucene goes so bear
with 
me. I am working on indexing and replacing search functionality for a 
website (about 10 gig in size, although only about 7 gig is indexed) I 
presently have a working model based on the luceneweb demo dispatched
with 
Lucene, this has already proven functional when tested on various sites 
(admittedly much smaller 200-400mb etc). However, issues occur when 
performing the index on the main site that I haven't found explained on
any 
of the Lucene forums thus far.

After a successful index and optimisation of the website (takes around
4hrs 
40m unoptimised) I can't get to the index.jsp or even access tomcat. My 
first thought was to restart tomcat. No joy and no access. Thinking the 
larger index had killed the test server I accessed apache on port 80,
which 
worked perfectly.  After a few checks I realised the test server was
fine, 
apache was fine, used the same application to create an index of the
tomcat 
docs so java was working. Confused I went back to the forums, FAQ's and 
groups to see if anyone had any similar problems and have come up with a

brief list of what my problem is not;

There is no index write.lock files found for Lucene in either /tmp or 
opt/tomcat/temp directories so the index is open to be searched. Nor
does 
'top' reveal anything overloading the system. Apache is running fine and

displays all relevant pages. Tomcat cannot be reached with a browser 
(neither the default congratulations page or the Luceneweb application) 
Tomcat was a fresh install as was Java, Tomcat logs show nothing
different 
to standard startup logs. So I logged the entire indexing process and
saw 
two errors occurring infrequently.

Parse Aborted: Encountered \ at line 6, column 129. //where these
values 
vary
Was expecting one of:
   ArgName ...
   = ...
   TagEnd ...

I'm satisfied this is just the HTML parser kicking off about some badly 
formatted HTML and is only affecting what is indexed but its here for 
completeness. The other error is more serious:

java.io.IOException: Pipe closed
   at java.io.PipedInputStream.receive(PipedInputStream.java:136)
   at java.io.PipedInputStream.receive(PipedInputStream.java:176)
   at java.io.PipedOutputStream.write(PipedOutputStream.java:129)
   at 
sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336)
   at 
sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java:395)
   at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
   at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:146)
   at java.io.OutputStreamWriter.write(OutputStreamWriter.java:204)
   at java.io.Writer.write(Writer.java:126)
   at 
org.apache.lucene.demo.html.HTMLParser.addText(HTMLParser.java:137)
   at 
org.apache.lucene.demo.html.HTMLParser.HTMLDocument(HTMLParser.java:203)
   at
org.apache.lucene.demo.html.ParserThread.run(ParserThread.java:31)

I'm again pretty sure that this is the same error that occurred once
before 
when I was using the maxFieldLength to limit the number of terms
recorded. 
I'm also confident it's a threading error and found the following post
by 
Doug Cutting that seemed to explain it
http://java2.5341.com/msg/80502.html 
however I am assuming that's what it is and haven't yet attempted to
change 
the threading system of the demo as yet due to my lack of java
knowledge.

The strange thing is after restarting the server all aspects of the
Lucene 
web application work perfectly stemming, alphanumeric indexing summaries
etc 
are all as expected, so I am left assuming due to this (and by running
out 
of options) that Lucene has somehow done something to Tomcat by doing
such a 
large index. Being that both run off Java I guess its something to do
with 
that but I have nowhere near enough experience in java to work out what

The system I am currently running on is Java - 1.4.2_05, Tomcat -
5.0.27, 
Lucene - 1.4.1, Linux version - 2.4.20-8 (gcc version 3.2.2 20030222
(Red 
Hat Linux 3.2.2-5)), Apache 2.0.42. I have not modified the mergeFactor
or 
MaxMergeDocuments nor am I using RAMdirectories. The processor is 800MHz
and 
there is 128mb of RAM.

If more info is required on setup, source code etc or you think this
should 
be moved to a tomcat forum just post.

Best regards and thanks in advance for any advice you can offer,

J Tyrrell



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing process causes Tomcat to stop working

2004-10-27 Thread James Tyrrell
Aad,
 D'oh forgot to mention that mildly important info. Rather than 
re-index I am just creating a new index each time, this makes things easier 
to roll-back etc (which is what my boss wants). the command line is 
something like java com.lucene.IndexHTML -create -index indexstore/ .. I 
have wondered about whether sessions could be a problem, but I don't think 
so, otherwise wouldn't a restart of Tomcat be sufficient rather than a 
reboot? I even tried the killall command on java  tomcat then started 
everything again to no avail.

cheers,
JT

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Indexing process causes Tomcat to stop working

2004-10-27 Thread Armbrust, Daniel C.
So, are you creating the indexes from inside the tomcat runtime, or are you creating 
them on the command line (which would be in a different runtime than tomcat)?

What happens to tomcat?  Does it hang - still running but not responsive?  Or does it 
crash?  

If it hangs, maybe you are running out of memory.  By default, Tomcat's limit is set 
pretty low...

There is no reason at all you should have to reboot... If you stop and start tomcat, 
(make sure it actually stopped - sometimes it requires a kill -9 when it really gets 
hung) it should start working again.  Depending on your setup of Tomcat + apache, you 
may  have to restart apache as well to get them linked to each other again...

Dan




-Original Message-
From: James Tyrrell [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 27, 2004 10:49 AM
To: [EMAIL PROTECTED]
Subject: RE: Indexing process causes Tomcat to stop working

Aad,
  D'oh forgot to mention that mildly important info. Rather than 
re-index I am just creating a new index each time, this makes things easier 
to roll-back etc (which is what my boss wants). the command line is 
something like java com.lucene.IndexHTML -create -index indexstore/ .. I 
have wondered about whether sessions could be a problem, but I don't think 
so, otherwise wouldn't a restart of Tomcat be sufficient rather than a 
reboot? I even tried the killall command on java  tomcat then started 
everything again to no avail.

cheers,

JT



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexWriter Constructor question

2004-10-27 Thread Armbrust, Daniel C.
Wouldn't it make more sense if the constructor for the IndexWriter always created an 
index if it doesn't exist - and the boolean parameter should be clear (instead of 
create)

So instead of this (from javadoc):

IndexWriter

public IndexWriter(Directory d,
   Analyzer a,
   boolean create)
throws IOException

Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
create is true, then a new, empty index will be created in d, replacing the index 
already there, if any.

Parameters:
d - the index directory
a - the analyzer to use
create - true to create the index or overwrite the existing one; false to append 
to the existing index 
Throws:
IOException - if the directory cannot be read/written to, or if it does not exist, 
and create is false


We would have this:

IndexWriter

public IndexWriter(Directory d,
   Analyzer a,
   boolean clear)
throws IOException

Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
clear is true, and a index exists at location d, then it will be erased, and a new, 
empty index will be created in d.

Parameters:
d - the index directory
a - the analyzer to use
clear - true to overwrite the existing one; false to append to the existing index 
Throws:
IOException - if the directory cannot be read/written to, or if it does not exist.



Its current behavior is kind of annoying, because I have an app that should never 
clear an existing index, it should always append.  So I want create set to false.  But 
when I am starting a brand new index, then I have to change the create flag to keep it 
from throwing an exception...  I guess for now I will have to write code to check if a 
index actually has content yet, and if it doesn't, change the flag on the fly.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter Constructor question

2004-10-27 Thread Justin Swanhart
You could always modify your own local copy if you want to change the
behavior of the parameter.

or just do:
IndexWriter w = new IndexWriter(indexDirectory,
new StandardAnalyzer(),
   
!(IndexReader.indexExists(indexDirectory))
);

If you do that, then if an index exists then it will not be created,
otherwise it will be...

On Wed, 27 Oct 2004 12:26:29 -0500, Armbrust, Daniel C.
[EMAIL PROTECTED] wrote:
 Wouldn't it make more sense if the constructor for the IndexWriter always created an 
 index if it doesn't exist - and the boolean parameter should be clear (instead of 
 create)
 
 So instead of this (from javadoc):
 
 IndexWriter
 
 public IndexWriter(Directory d,
Analyzer a,
boolean create)
 throws IOException
 
 Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
 create is true, then a new, empty index will be created in d, replacing the index 
 already there, if any.
 
 Parameters:
 d - the index directory
 a - the analyzer to use
 create - true to create the index or overwrite the existing one; false to append 
 to the existing index
 Throws:
 IOException - if the directory cannot be read/written to, or if it does not 
 exist, and create is false
 
 We would have this:
 
 IndexWriter
 
 public IndexWriter(Directory d,
Analyzer a,
boolean clear)
 throws IOException
 
 Constructs an IndexWriter for the index in d. Text will be analyzed with a. If 
 clear is true, and a index exists at location d, then it will be erased, and a new, 
 empty index will be created in d.
 
 Parameters:
 d - the index directory
 a - the analyzer to use
 clear - true to overwrite the existing one; false to append to the existing index
 Throws:
 IOException - if the directory cannot be read/written to, or if it does not 
 exist.
 
 Its current behavior is kind of annoying, because I have an app that should never 
 clear an existing index, it should always append.  So I want create set to false.  
 But when I am starting a brand new index, then I have to change the create flag to 
 keep it from throwing an exception...  I guess for now I will have to write code to 
 check if a index actually has content yet, and if it doesn't, change the flag on the 
 fly.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Daniel Naber
On Wednesday 27 October 2004 20:20, Kevin A. Burton wrote:

 http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForSho
rtText/

(Kevin complains about shorter documents ranked higher)

This is something that can easily be fixed. Just use a Similarity 
implementation that extends DefaultSimilarity and that overwrites 
lengthNorm: just return 1.0f there. You need to use that Similarity for 
indexing and searching, i.e. it requires reindexing.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stopwords in Exact phrase

2004-10-27 Thread Ravi
 Is there way to include stopwords in an exact phrase search? For
example, when I search on Melbourne IT, Lucene only searches for
Melbourne ignoring IT. 

Thanks,
Ravi. 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stopwords in Exact phrase

2004-10-27 Thread Erik Hatcher
On Oct 27, 2004, at 3:36 PM, Ravi wrote:
 Is there way to include stopwords in an exact phrase search? For
example, when I search on Melbourne IT, Lucene only searches for
Melbourne ignoring IT.
But you want stop words removed for general term queries?
Have a look at how Nutch does its thing - it has a very similar type of 
situation where it deals with common terms differently if they are in a 
phrase.

There are other choices - use a different analyzer, and if you want 
that used only for phrase queries you can override QueryParser and its 
getFieldQuery method.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Stopwords in Exact phrase

2004-10-27 Thread Justin Swanhart
your analyzer will have removed the stopword when you indexed your documents, so
lucene won't be able to do this for you.

You will need to implement a second pass over the results returned by lucene and
check to see if the stopword is included, perhaps with String.indexOf()


On Wed, 27 Oct 2004 14:36:14 -0500, Ravi [EMAIL PROTECTED] wrote:
  Is there way to include stopwords in an exact phrase search? For
 example, when I search on Melbourne IT, Lucene only searches for
 Melbourne ignoring IT.
 
 Thanks,
 Ravi.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighter problem: null as result

2004-10-27 Thread Miro Max
Hello,

i'm trying to use highlighter from sandbox and
actually i've got a problem with some results getting
from highlighter.

normaly when i search in my index for ex. motor i
get 
circa 150 results -- this results are ok.
but when i use highlighter i get some results as
null values from the field content.

is this a bug in the highlighter class?

greetings

jose






___
Gesendet von Yahoo! Mail - Jetzt mit 100MB Speicher kostenlos - Hier anmelden: 
http://mail.yahoo.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
Daniel Naber wrote:
(Kevin complains about shorter documents ranked higher)
This is something that can easily be fixed. Just use a Similarity 
implementation that extends DefaultSimilarity and that overwrites 
lengthNorm: just return 1.0f there. You need to use that Similarity for 
indexing and searching, i.e. it requires reindexing.
 

What happens when I do this with an existing index? I don't want to have 
to rewrite this index as it will take FOREVER

If the current behavior is all that happens this is fine... this way I 
can just get this behavior for new documents that are added.

Also... why isn't this the default?
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: new version of NewMultiFieldQueryParser

2004-10-27 Thread Bill Janssen
 I'm not sure this solution is very robust 

Thanks, but I'm pretty sure it *is* robust.  Can you please offer a
specific critique?  Always happy to learn and improve :-).

 I think I already sent an email with a better code...

Pretty vague.  Can you send a URL for that message in the archive?

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Looking for consulting help on project

2004-10-27 Thread David Spencer
Suggestions
[a]
Try invoking the VM w/ an option like -XX:CompileThreshold=100 or even 
a smaller number. This encourages the hotspot VM to compile methods 
sooner, thus the app will take less time to warm up.

http://java.sun.com/docs/hotspot/VMOptions.html#additional
You might want to search the web for refs to this, esp how things like 
Eclipse is brought up, as I think their invocation script sets other 
obscure options to guide GC too.

[b]
Any time I've worked w/ a hard core java server I've always found it 
helpful to have a loop explicitly trying to force gc - this is the idiom 
I use (i.e. you may have to do more than just System.gc()), and my 
suggestion is to try calling this every 15-60 secs so that memory use 
never jumps. I know that in theory you should never need to, but it may 
help.

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Gordon Riggs wrote:
Hi,
 
I am working on a web development project using PHP and mySQL. The team has
implemented full text search with mySQL, but is now researching Lucene to
help with performance/scalability issues. The team is looking for a
developer who has experience working with Lucene and can assist with
integrating into our environment. What follows is a brief overview of the
problems that we're working to address. If you have the experience with
using Lucene with large amounts of data (we have roughly 16 million records)
where search time is critical (needs to be under .2 seconds), then please
respond.
 
Thanks,
Gordon Riggs
[EMAIL PROTECTED]
 
1. Loading index into memory using Lucene's RAMDirectory
Why is the Java heap 2.9GB for a 1.4GB index?
Why can we not load an index over 1.4GB in size?  We receive
'java.lang.OutOfMemoryError' even with the -mx flag set to as high as '10g'.
We're using a dedicated test machine which has dual AMD Opteron processors
and 12GB of memory.  The OS is SuSE Linux Enterprise Server 9 (x86_64).  The
java version is: Java(TM) 2 Runtime Environment, Standard Edition (build
Blackdown-1.4.2) Java HotSpot(TM) 64-Bit Server VM (build
Blackdown-1.4.2-fcs, mixed mode)
We also get similar results with: Java(TM) 2 Runtime Environment, Standard
Edition (build 1.4.2_03-b02) Java HotSpot(TM) Client VM (build 1.4.2_03-b02,
mixed mode)

2. How to keep Lucene and Java in memory, to improve performance
The idea is to have a Lucene daemon that loads the index into memory once
on startup. It then listens for connections and performs search requests for
clients using that single index instance.
Do you foresee any problems (other than the ones stated above) with this
approach?
Garbage collection and/or memory leaks?  Performance issues?  
Concurrency issues with multiple searches coming in at once?
What's involved in writing the daemon?
Assuming that we need the daemon, we need to find out how big a job it is to
develop, what requirements need to be specified, etc.

3. How to interface our PHP web application with Java
Our web application is written in PHP so we need a communication interface
for performing search queries that is both PHP and Java friendly.
What do you think would be a good solution?  XML-RPC?
What's involved in developing the solution?
4. How to tune Lucene
Are there ways to tune Lucene in order to improve performance? We already
plan on moving the index into memory.
What else can be done to improve the search times? Can the way the index is
built affect performance?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


weights on multi index searches

2004-10-27 Thread Ravi
Can I give weights on different indexes when I search against multiple
indexes. The final score of a document should be a linear combination of
the weights on each index and the individual score for that index. Is
this possible in Lucene?
 
 
Thanks
Ravi. 


Locks and Readers and Writers

2004-10-27 Thread yahootintin . 1247688
Hi,



I'm getting:

java.io.IOException: Lock obtain timed out



I have
a writer service that opens the index to delete and add docs.  I have a reader
service that opens the index for searching only.



This error occurs when
the reader service opens the index (this takes about 500ms).  Meanwhile the
writer service tries to open it a couple milliseconds later.  The reader service
hasn't fully opened the index yet and this exception gets thrown.



What
are my options?  Should I just set the timeout to a higher value?



Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Daniel Naber
On Wednesday 27 October 2004 22:47, Kevin A. Burton wrote:

 If the current behavior is all that happens this is fine... this way I
 can just get this behavior for new documents that are added.

You'll have to try it out, I'm not sure what exactly will happen.

 Also... why isn't this the default?

You'll probably end up with many documents having exactly the same ranking. 
And those documents will then be sorted in a random order (not really, 
they will by sorted by internal ID I think, but that's no useful order for 
most use cases).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



document ID and performance

2004-10-27 Thread Yan Pujante
Hello
I wrote the following test programs:
I index 150,000 documents in Lucene and I build each document using 
this method.

private Document buildDocument(String documentID, String body)
{
Document document = new Document();
document.add(Field.Keyword(docID, documentID));
document.add(Field.UnStored(body, body));
return document;
}
I then run a search using the following method:
int search(String word) throws IOException
{
  IndexSearcher searcher = new IndexSercher(_indexDirectory);
  try
  {
  Query q = new TermQuery(new Term(body, word));
  Hits hits = searcher.search(q);
  return hits.length();
  }
  finally
  {
searcher.close();
  }
}
when I run this method on the word 'software' I get about 20,000 
results and it takes an average of 22ms per search which is very good.

If I run the following method:
List search2(String word) throws IOException
{
  IndexSearcher searcher = new IndexSercher(_indexDirectory);
  try
  {
  Query q = new TermQuery(new Term(body, word));
  Hits hits = searcher.search(q);
  ArrayList res = new ArrayList(hits.length());
  for(int i = 0; i  res.size(); i++)
  {
res.add(hits.doc(i).get(docID);
  }
  return res;
  }
  finally
  {
searcher.close();
  }
}
I get of course the same number of results but the performances really 
drop: I get a time which varies from 300ms to 700ms per query and it is 
not consistent.. it varies a lot from one run to the other.

If I run this other method:
List search2(String word) throws IOException
{
  IndexSearcher searcher = new IndexSercher(_indexDirectory);
  try
  {
  Query q = new TermQuery(new Term(body, word));
  MyHitCollector collector = new MyHitCollector();
  searcher.search(q, collector);
  return collector.getDocumentIDs();
  }
  finally
  {
searcher.close();
  }
}
with
  public class MyHitCollector extends HitCollector
  {
ArrayList res = new ArrayList();
public void collect(int i, float v)
{
  res.add(String.valueOf(i));
}
   public List getDocumentIDs()
  {
return res;
  }
  }
I get the same kind of results I was getting the first time: about 22ms 
to run the query.

This clearly shows that the action of searching the documents is 
extremely fast. And it is the action of actually accessing the 
documents which makes the performance drop (hits(i)...)

I know that there is no relationship between the document id returned 
in the collect method and the document id I store myself in the docID 
field, but technically that is the only thing I care about:

I want to run a very fast search that simply returns the matching 
document id. Is there any way to associate the document id returned in 
the hit collector to the internal document ID stored in the index ? 
Anybody has any idea how to do that ? Ideally you would want to be able 
to write something like this:

document.add(Field.ID(documentID));
and then in the HitCollector API:
collect(String documentID, float score) with the documentID being the 
one you stored (but which would be returned very efficiently)

Thanks for your help
Yan Pujante
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Documents with 1 word are given unfair lengthNorm()

2004-10-27 Thread Kevin A. Burton
WRT to my blog post:
It seems the problem is that the distribution for lengthNorm() starts at 
1 and moves down from there.  1.0f would work but HUGE documents would 
be normalized and so would distort the results.

What would you think of using this implementation for lengthNorm:
public float lengthNorm( String fieldName, int numTokens ) {
int THRESHOLD = 50;

int nt = numTokens;

if ( numTokens = THRESHOLD )
++nt;

if ( numTokens  THRESHOLD )
nt -= THRESHOLD;

float v = (float)(1.0 / Math.sqrt(nt));

if ( numTokens = THRESHOLD )
v = 1 - v;
return v;
}
This starts the distribution low... approaches 1.0 when 50 terms are in 
the document... then asymptotically moves to zero from here on out based 
on sqrt.

For example with values from 1 - 150 would yield (I'd graph this out 
but I'm too lazy):

1 - 0.29289323
2 - 0.42264974
3 - 0.5
4 - 0.5527864
5 - 0.5917517
6 - 0.6220355
7 - 0.6464466
8 - 0.666
9 - 0.6837722
10 - 0.69848865
11 - 0.7113249
12 - 0.72264993
13 - 0.73273873
14 - 0.74180114
15 - 0.75
16 - 0.7574644
17 - 0.7642977
18 - 0.7705843
19 - 0.7763932
20 - 0.7817821
21 - 0.7867993
22 - 0.7914856
23 - 0.79587585
24 - 0.8
25 - 0.80388385
26 - 0.8075499
27 - 0.81101775
28 - 0.81430465
29 - 0.81742585
30 - 0.8203947
31 - 0.8232233
32 - 0.82592237
33 - 0.8285014
34 - 0.83096915
35 - 0.833
36 - 0.83560103
37 - 0.83777857
38 - 0.8398719
39 - 0.8418861
40 - 0.84382623
41 - 0.8456966
42 - 0.8475014
43 - 0.84924436
44 - 0.8509288
45 - 0.852558
46 - 0.85413504
47 - 0.85566247
48 - 0.85714287
49 - 0.8585786
50 - 0.859972
51 - 1.0
52 - 0.70710677
53 - 0.57735026
54 - 0.5
55 - 0.4472136
56 - 0.4082483
57 - 0.37796447
58 - 0.35355338
59 - 0.3334
60 - 0.31622776
61 - 0.30151135
62 - 0.28867513
63 - 0.2773501
64 - 0.26726124
65 - 0.2581989
66 - 0.25
67 - 0.24253562
68 - 0.23570226
69 - 0.22941573
70 - 0.2236068
71 - 0.2182179
72 - 0.21320072
73 - 0.2085144
74 - 0.20412415
75 - 0.2
76 - 0.19611613
77 - 0.19245009
78 - 0.18898223
79 - 0.18569534
80 - 0.18257418
81 - 0.1796053
82 - 0.17677669
83 - 0.17407766
84 - 0.17149858
85 - 0.16903085
86 - 0.1667
87 - 0.16439898
88 - 0.16222142
89 - 0.16012815
90 - 0.15811388
91 - 0.15617377
92 - 0.15430336
93 - 0.15249857
94 - 0.15075567
95 - 0.1490712
96 - 0.14744195
97 - 0.145865
98 - 0.14433756
99 - 0.14285715
100 - 0.14142136
101 - 0.14002801
102 - 0.13867505
103 - 0.13736056
104 - 0.13608277
105 - 0.13483997
106 - 0.13363062
107 - 0.13245323
108 - 0.13130644
109 - 0.13018891
110 - 0.12909944
111 - 0.12803689
112 - 0.12700012
113 - 0.12598816
114 - 0.125
115 - 0.12403473
116 - 0.12309149
117 - 0.12216944
118 - 0.12126781
119 - 0.120385855
120 - 0.11952286
121 - 0.11867817
122 - 0.11785113
123 - 0.11704115
124 - 0.11624764
125 - 0.11547005
126 - 0.114707865
127 - 0.11396058
128 - 0.1132277
129 - 0.11250879
130 - 0.1118034
131 - 0.
132 - 0.11043153
133 - 0.10976426
134 - 0.10910895
135 - 0.10846523
136 - 0.107832775
137 - 0.107211255
138 - 0.10660036
139 - 0.10599979
140 - 0.10540926
141 - 0.104828484
142 - 0.1042572
143 - 0.10369517
144 - 0.10314213
145 - 0.10259783
146 - 0.10206208
147 - 0.10153462
148 - 0.101015255
149 - 0.10050378

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]