Re: new version of NewMultiFieldQueryParser
Bill Janssen wrote: I'm not sure this solution is very robust I think I already sent an email with a better code... Sergiu Thanks to something Doug said when I first opened this discussion, I went back and looked at my implementation. He said, Can't we just do this in getFieldQuery?. Figuring that he probably knew what he was talking about, I looked a bit harder, and it turns out he was right. Here's a much simpler version of NewMultiFieldQueryParser that seems to work. [For those just tuning in, this is a version of MultiFieldQueryParser that will work with a default query operator of AND, as well as with OR.] Enjoy! Bill class NewMultiFieldQueryParser extends QueryParser { static private final String DEFAULT_FIELD = %%; protected String[] fieldnames = null; private Analyzer analyzer = null; public NewMultiFieldQueryParser (Analyzer a) { super(DEFAULT_FIELD, a); } public NewMultiFieldQueryParser (String[] f, Analyzer a) { super(DEFAULT_FIELD, a); fieldnames = f; analyzer = a; } public void setFieldNames (String[] f) { fieldnames = f; } protected Query getFieldQuery (String field, Analyzer a, String queryText) throws ParseException { Query x = super.getFieldQuery(field, a, queryText); if (field == DEFAULT_FIELD (fieldnames != null)) { BooleanQuery q2 = new BooleanQuery(); if (x instanceof PhraseQuery) { Term[] terms = ((PhraseQuery)x).getTerms(); for (int i = 0; i fieldnames.length; i++) { PhraseQuery q3 = new PhraseQuery(); q3.setSlop(((PhraseQuery)x).getSlop()); for (int j = 0; j terms.length; j++) { q3.add(new Term(fieldnames[i], terms[j].text())); } q2.add(q3, false, false); } } else if (x instanceof TermQuery) { String text = ((TermQuery)x).getTerm().text(); for (int i = 0; i fieldnames.length; i++) { q2.add(new TermQuery(new Term(fieldnames[i], text)), false, false); } } return q2; } return x; } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Backup strategies
Hi, I'm curious about your strategy to backup indexes based on FSDirectory. If I do a file based copy I suspect I will get corrupted data because of concurrent write access. My current favorite is to create an empty index and use IndexWriter.addIndexes() to copy the current index state. But I'm not sure about the performance of this solution. How do you make your backups? Regards, Christoph - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Backup strategies
Christoph Kiehl wrote: I'm curious about your strategy to backup indexes based on FSDirectory. If I do a file based copy I suspect I will get corrupted data because of concurrent write access. My current favorite is to create an empty index and use IndexWriter.addIndexes() to copy the current index state. But I'm not sure about the performance of this solution. I have no practical experience with backing up an online index, but I would try to find out the details of the write lock mechanism used by Lucene at the file level. You can then create a backup component that write-locks the index and does a regular file copy of the index dir. During backup time searches can continue while updates will be temporarily blocked. But as I said, I'm only speculating... Chris -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Backup strategies
Christiaan Fluit wrote: I have no practical experience with backing up an online index, but I would try to find out the details of the write lock mechanism used by Lucene at the file level. You can then create a backup component that write-locks the index and does a regular file copy of the index dir. During backup time searches can continue while updates will be temporarily blocked. The problem with this approach is that this will not only block write operations but you will get timeouts for these operations which will lead to exceptions. To prevent this you must implement some queuing, which is what I would like avoid. Regards, Christoph - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boost value
Hello, I am working on Lucene and tried to understand the calculation of the score value. As far as I understand it works as follows: (1) idf = ln(numDocs/(docFreq+1)) (2) queryWeight = idf * boost (3) sumOfSquaredWeights = queryWeight * queryWeight (4) norm = 1/sqrt(sumOfSquaredWeights) ??? Question 1: why not norm = 1/queryWeight (5) queryWeight' = queryWeight * norm (6) weightValue = queryWeight' * idf ??? Question 2: using (6) and insert (1) - (5) step by step = weightValue = idf I did only pure algebraical substitutions and it all comes to a simple formula. The boost value is not needed anymore. Where is my fault? Thanks, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing process causes Tomcat to stop working
James, How do you kick off your reindex? Could it be a session timeout? cheers, Aad Hello, I am a Java/Lucene/Tomcat newbie I know that does not bode well as a start to a post but I really am in dire straits as far as Lucene goes so bear with me. I am working on indexing and replacing search functionality for a website (about 10 gig in size, although only about 7 gig is indexed) I presently have a working model based on the luceneweb demo dispatched with Lucene, this has already proven functional when tested on various sites (admittedly much smaller 200-400mb etc). However, issues occur when performing the index on the main site that I haven't found explained on any of the Lucene forums thus far. After a successful index and optimisation of the website (takes around 4hrs 40m unoptimised) I can't get to the index.jsp or even access tomcat. My first thought was to restart tomcat. No joy and no access. Thinking the larger index had killed the test server I accessed apache on port 80, which worked perfectly. After a few checks I realised the test server was fine, apache was fine, used the same application to create an index of the tomcat docs so java was working. Confused I went back to the forums, FAQ's and groups to see if anyone had any similar problems and have come up with a brief list of what my problem is not; There is no index write.lock files found for Lucene in either /tmp or opt/tomcat/temp directories so the index is open to be searched. Nor does 'top' reveal anything overloading the system. Apache is running fine and displays all relevant pages. Tomcat cannot be reached with a browser (neither the default congratulations page or the Luceneweb application) Tomcat was a fresh install as was Java, Tomcat logs show nothing different to standard startup logs. So I logged the entire indexing process and saw two errors occurring infrequently. Parse Aborted: Encountered \ at line 6, column 129. //where these values vary Was expecting one of: ArgName ... = ... TagEnd ... I'm satisfied this is just the HTML parser kicking off about some badly formatted HTML and is only affecting what is indexed but its here for completeness. The other error is more serious: java.io.IOException: Pipe closed at java.io.PipedInputStream.receive(PipedInputStream.java:136) at java.io.PipedInputStream.receive(PipedInputStream.java:176) at java.io.PipedOutputStream.write(PipedOutputStream.java:129) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java:395) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:146) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:204) at java.io.Writer.write(Writer.java:126) at org.apache.lucene.demo.html.HTMLParser.addText(HTMLParser.java:137) at org.apache.lucene.demo.html.HTMLParser.HTMLDocument(HTMLParser.java:203) at org.apache.lucene.demo.html.ParserThread.run(ParserThread.java:31) I'm again pretty sure that this is the same error that occurred once before when I was using the maxFieldLength to limit the number of terms recorded. I'm also confident it's a threading error and found the following post by Doug Cutting that seemed to explain it http://java2.5341.com/msg/80502.html however I am assuming that's what it is and haven't yet attempted to change the threading system of the demo as yet due to my lack of java knowledge. The strange thing is after restarting the server all aspects of the Lucene web application work perfectly stemming, alphanumeric indexing summaries etc are all as expected, so I am left assuming due to this (and by running out of options) that Lucene has somehow done something to Tomcat by doing such a large index. Being that both run off Java I guess its something to do with that but I have nowhere near enough experience in java to work out what The system I am currently running on is Java - 1.4.2_05, Tomcat - 5.0.27, Lucene - 1.4.1, Linux version - 2.4.20-8 (gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)), Apache 2.0.42. I have not modified the mergeFactor or MaxMergeDocuments nor am I using RAMdirectories. The processor is 800MHz and there is 128mb of RAM. If more info is required on setup, source code etc or you think this should be moved to a tomcat forum just post. Best regards and thanks in advance for any advice you can offer, J Tyrrell - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing process causes Tomcat to stop working
Aad, D'oh forgot to mention that mildly important info. Rather than re-index I am just creating a new index each time, this makes things easier to roll-back etc (which is what my boss wants). the command line is something like java com.lucene.IndexHTML -create -index indexstore/ .. I have wondered about whether sessions could be a problem, but I don't think so, otherwise wouldn't a restart of Tomcat be sufficient rather than a reboot? I even tried the killall command on java tomcat then started everything again to no avail. cheers, JT - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing process causes Tomcat to stop working
So, are you creating the indexes from inside the tomcat runtime, or are you creating them on the command line (which would be in a different runtime than tomcat)? What happens to tomcat? Does it hang - still running but not responsive? Or does it crash? If it hangs, maybe you are running out of memory. By default, Tomcat's limit is set pretty low... There is no reason at all you should have to reboot... If you stop and start tomcat, (make sure it actually stopped - sometimes it requires a kill -9 when it really gets hung) it should start working again. Depending on your setup of Tomcat + apache, you may have to restart apache as well to get them linked to each other again... Dan -Original Message- From: James Tyrrell [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 27, 2004 10:49 AM To: [EMAIL PROTECTED] Subject: RE: Indexing process causes Tomcat to stop working Aad, D'oh forgot to mention that mildly important info. Rather than re-index I am just creating a new index each time, this makes things easier to roll-back etc (which is what my boss wants). the command line is something like java com.lucene.IndexHTML -create -index indexstore/ .. I have wondered about whether sessions could be a problem, but I don't think so, otherwise wouldn't a restart of Tomcat be sufficient rather than a reboot? I even tried the killall command on java tomcat then started everything again to no avail. cheers, JT - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexWriter Constructor question
Wouldn't it make more sense if the constructor for the IndexWriter always created an index if it doesn't exist - and the boolean parameter should be clear (instead of create) So instead of this (from javadoc): IndexWriter public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If create is true, then a new, empty index will be created in d, replacing the index already there, if any. Parameters: d - the index directory a - the analyzer to use create - true to create the index or overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist, and create is false We would have this: IndexWriter public IndexWriter(Directory d, Analyzer a, boolean clear) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If clear is true, and a index exists at location d, then it will be erased, and a new, empty index will be created in d. Parameters: d - the index directory a - the analyzer to use clear - true to overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist. Its current behavior is kind of annoying, because I have an app that should never clear an existing index, it should always append. So I want create set to false. But when I am starting a brand new index, then I have to change the create flag to keep it from throwing an exception... I guess for now I will have to write code to check if a index actually has content yet, and if it doesn't, change the flag on the fly. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter Constructor question
You could always modify your own local copy if you want to change the behavior of the parameter. or just do: IndexWriter w = new IndexWriter(indexDirectory, new StandardAnalyzer(), !(IndexReader.indexExists(indexDirectory)) ); If you do that, then if an index exists then it will not be created, otherwise it will be... On Wed, 27 Oct 2004 12:26:29 -0500, Armbrust, Daniel C. [EMAIL PROTECTED] wrote: Wouldn't it make more sense if the constructor for the IndexWriter always created an index if it doesn't exist - and the boolean parameter should be clear (instead of create) So instead of this (from javadoc): IndexWriter public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If create is true, then a new, empty index will be created in d, replacing the index already there, if any. Parameters: d - the index directory a - the analyzer to use create - true to create the index or overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist, and create is false We would have this: IndexWriter public IndexWriter(Directory d, Analyzer a, boolean clear) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If clear is true, and a index exists at location d, then it will be erased, and a new, empty index will be created in d. Parameters: d - the index directory a - the analyzer to use clear - true to overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist. Its current behavior is kind of annoying, because I have an app that should never clear an existing index, it should always append. So I want create set to false. But when I am starting a brand new index, then I have to change the create flag to keep it from throwing an exception... I guess for now I will have to write code to check if a index actually has content yet, and if it doesn't, change the flag on the fly. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Poor Lucene Ranking for Short Text
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
On Wednesday 27 October 2004 20:20, Kevin A. Burton wrote: http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForSho rtText/ (Kevin complains about shorter documents ranked higher) This is something that can easily be fixed. Just use a Similarity implementation that extends DefaultSimilarity and that overwrites lengthNorm: just return 1.0f there. You need to use that Similarity for indexing and searching, i.e. it requires reindexing. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stopwords in Exact phrase
Is there way to include stopwords in an exact phrase search? For example, when I search on Melbourne IT, Lucene only searches for Melbourne ignoring IT. Thanks, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stopwords in Exact phrase
On Oct 27, 2004, at 3:36 PM, Ravi wrote: Is there way to include stopwords in an exact phrase search? For example, when I search on Melbourne IT, Lucene only searches for Melbourne ignoring IT. But you want stop words removed for general term queries? Have a look at how Nutch does its thing - it has a very similar type of situation where it deals with common terms differently if they are in a phrase. There are other choices - use a different analyzer, and if you want that used only for phrase queries you can override QueryParser and its getFieldQuery method. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stopwords in Exact phrase
your analyzer will have removed the stopword when you indexed your documents, so lucene won't be able to do this for you. You will need to implement a second pass over the results returned by lucene and check to see if the stopword is included, perhaps with String.indexOf() On Wed, 27 Oct 2004 14:36:14 -0500, Ravi [EMAIL PROTECTED] wrote: Is there way to include stopwords in an exact phrase search? For example, when I search on Melbourne IT, Lucene only searches for Melbourne ignoring IT. Thanks, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighter problem: null as result
Hello, i'm trying to use highlighter from sandbox and actually i've got a problem with some results getting from highlighter. normaly when i search in my index for ex. motor i get circa 150 results -- this results are ok. but when i use highlighter i get some results as null values from the field content. is this a bug in the highlighter class? greetings jose ___ Gesendet von Yahoo! Mail - Jetzt mit 100MB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
Daniel Naber wrote: (Kevin complains about shorter documents ranked higher) This is something that can easily be fixed. Just use a Similarity implementation that extends DefaultSimilarity and that overwrites lengthNorm: just return 1.0f there. You need to use that Similarity for indexing and searching, i.e. it requires reindexing. What happens when I do this with an existing index? I don't want to have to rewrite this index as it will take FOREVER If the current behavior is all that happens this is fine... this way I can just get this behavior for new documents that are added. Also... why isn't this the default? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new version of NewMultiFieldQueryParser
I'm not sure this solution is very robust Thanks, but I'm pretty sure it *is* robust. Can you please offer a specific critique? Always happy to learn and improve :-). I think I already sent an email with a better code... Pretty vague. Can you send a URL for that message in the archive? Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Looking for consulting help on project
Suggestions [a] Try invoking the VM w/ an option like -XX:CompileThreshold=100 or even a smaller number. This encourages the hotspot VM to compile methods sooner, thus the app will take less time to warm up. http://java.sun.com/docs/hotspot/VMOptions.html#additional You might want to search the web for refs to this, esp how things like Eclipse is brought up, as I think their invocation script sets other obscure options to guide GC too. [b] Any time I've worked w/ a hard core java server I've always found it helpful to have a loop explicitly trying to force gc - this is the idiom I use (i.e. you may have to do more than just System.gc()), and my suggestion is to try calling this every 15-60 secs so that memory use never jumps. I know that in theory you should never need to, but it may help. public static long gc() { long bef = mem(); System.gc(); sleep( 100); System.runFinalization(); sleep( 100); System.gc(); long aft= mem(); return aft-bef; } Gordon Riggs wrote: Hi, I am working on a web development project using PHP and mySQL. The team has implemented full text search with mySQL, but is now researching Lucene to help with performance/scalability issues. The team is looking for a developer who has experience working with Lucene and can assist with integrating into our environment. What follows is a brief overview of the problems that we're working to address. If you have the experience with using Lucene with large amounts of data (we have roughly 16 million records) where search time is critical (needs to be under .2 seconds), then please respond. Thanks, Gordon Riggs [EMAIL PROTECTED] 1. Loading index into memory using Lucene's RAMDirectory Why is the Java heap 2.9GB for a 1.4GB index? Why can we not load an index over 1.4GB in size? We receive 'java.lang.OutOfMemoryError' even with the -mx flag set to as high as '10g'. We're using a dedicated test machine which has dual AMD Opteron processors and 12GB of memory. The OS is SuSE Linux Enterprise Server 9 (x86_64). The java version is: Java(TM) 2 Runtime Environment, Standard Edition (build Blackdown-1.4.2) Java HotSpot(TM) 64-Bit Server VM (build Blackdown-1.4.2-fcs, mixed mode) We also get similar results with: Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_03-b02) Java HotSpot(TM) Client VM (build 1.4.2_03-b02, mixed mode) 2. How to keep Lucene and Java in memory, to improve performance The idea is to have a Lucene daemon that loads the index into memory once on startup. It then listens for connections and performs search requests for clients using that single index instance. Do you foresee any problems (other than the ones stated above) with this approach? Garbage collection and/or memory leaks? Performance issues? Concurrency issues with multiple searches coming in at once? What's involved in writing the daemon? Assuming that we need the daemon, we need to find out how big a job it is to develop, what requirements need to be specified, etc. 3. How to interface our PHP web application with Java Our web application is written in PHP so we need a communication interface for performing search queries that is both PHP and Java friendly. What do you think would be a good solution? XML-RPC? What's involved in developing the solution? 4. How to tune Lucene Are there ways to tune Lucene in order to improve performance? We already plan on moving the index into memory. What else can be done to improve the search times? Can the way the index is built affect performance? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
weights on multi index searches
Can I give weights on different indexes when I search against multiple indexes. The final score of a document should be a linear combination of the weights on each index and the individual score for that index. Is this possible in Lucene? Thanks Ravi.
Locks and Readers and Writers
Hi, I'm getting: java.io.IOException: Lock obtain timed out I have a writer service that opens the index to delete and add docs. I have a reader service that opens the index for searching only. This error occurs when the reader service opens the index (this takes about 500ms). Meanwhile the writer service tries to open it a couple milliseconds later. The reader service hasn't fully opened the index yet and this exception gets thrown. What are my options? Should I just set the timeout to a higher value? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
On Wednesday 27 October 2004 22:47, Kevin A. Burton wrote: If the current behavior is all that happens this is fine... this way I can just get this behavior for new documents that are added. You'll have to try it out, I'm not sure what exactly will happen. Also... why isn't this the default? You'll probably end up with many documents having exactly the same ranking. And those documents will then be sorted in a random order (not really, they will by sorted by internal ID I think, but that's no useful order for most use cases). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
document ID and performance
Hello I wrote the following test programs: I index 150,000 documents in Lucene and I build each document using this method. private Document buildDocument(String documentID, String body) { Document document = new Document(); document.add(Field.Keyword(docID, documentID)); document.add(Field.UnStored(body, body)); return document; } I then run a search using the following method: int search(String word) throws IOException { IndexSearcher searcher = new IndexSercher(_indexDirectory); try { Query q = new TermQuery(new Term(body, word)); Hits hits = searcher.search(q); return hits.length(); } finally { searcher.close(); } } when I run this method on the word 'software' I get about 20,000 results and it takes an average of 22ms per search which is very good. If I run the following method: List search2(String word) throws IOException { IndexSearcher searcher = new IndexSercher(_indexDirectory); try { Query q = new TermQuery(new Term(body, word)); Hits hits = searcher.search(q); ArrayList res = new ArrayList(hits.length()); for(int i = 0; i res.size(); i++) { res.add(hits.doc(i).get(docID); } return res; } finally { searcher.close(); } } I get of course the same number of results but the performances really drop: I get a time which varies from 300ms to 700ms per query and it is not consistent.. it varies a lot from one run to the other. If I run this other method: List search2(String word) throws IOException { IndexSearcher searcher = new IndexSercher(_indexDirectory); try { Query q = new TermQuery(new Term(body, word)); MyHitCollector collector = new MyHitCollector(); searcher.search(q, collector); return collector.getDocumentIDs(); } finally { searcher.close(); } } with public class MyHitCollector extends HitCollector { ArrayList res = new ArrayList(); public void collect(int i, float v) { res.add(String.valueOf(i)); } public List getDocumentIDs() { return res; } } I get the same kind of results I was getting the first time: about 22ms to run the query. This clearly shows that the action of searching the documents is extremely fast. And it is the action of actually accessing the documents which makes the performance drop (hits(i)...) I know that there is no relationship between the document id returned in the collect method and the document id I store myself in the docID field, but technically that is the only thing I care about: I want to run a very fast search that simply returns the matching document id. Is there any way to associate the document id returned in the hit collector to the internal document ID stored in the index ? Anybody has any idea how to do that ? Ideally you would want to be able to write something like this: document.add(Field.ID(documentID)); and then in the HitCollector API: collect(String documentID, float score) with the documentID being the one you stored (but which would be returned very efficiently) Thanks for your help Yan Pujante - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Documents with 1 word are given unfair lengthNorm()
WRT to my blog post: It seems the problem is that the distribution for lengthNorm() starts at 1 and moves down from there. 1.0f would work but HUGE documents would be normalized and so would distort the results. What would you think of using this implementation for lengthNorm: public float lengthNorm( String fieldName, int numTokens ) { int THRESHOLD = 50; int nt = numTokens; if ( numTokens = THRESHOLD ) ++nt; if ( numTokens THRESHOLD ) nt -= THRESHOLD; float v = (float)(1.0 / Math.sqrt(nt)); if ( numTokens = THRESHOLD ) v = 1 - v; return v; } This starts the distribution low... approaches 1.0 when 50 terms are in the document... then asymptotically moves to zero from here on out based on sqrt. For example with values from 1 - 150 would yield (I'd graph this out but I'm too lazy): 1 - 0.29289323 2 - 0.42264974 3 - 0.5 4 - 0.5527864 5 - 0.5917517 6 - 0.6220355 7 - 0.6464466 8 - 0.666 9 - 0.6837722 10 - 0.69848865 11 - 0.7113249 12 - 0.72264993 13 - 0.73273873 14 - 0.74180114 15 - 0.75 16 - 0.7574644 17 - 0.7642977 18 - 0.7705843 19 - 0.7763932 20 - 0.7817821 21 - 0.7867993 22 - 0.7914856 23 - 0.79587585 24 - 0.8 25 - 0.80388385 26 - 0.8075499 27 - 0.81101775 28 - 0.81430465 29 - 0.81742585 30 - 0.8203947 31 - 0.8232233 32 - 0.82592237 33 - 0.8285014 34 - 0.83096915 35 - 0.833 36 - 0.83560103 37 - 0.83777857 38 - 0.8398719 39 - 0.8418861 40 - 0.84382623 41 - 0.8456966 42 - 0.8475014 43 - 0.84924436 44 - 0.8509288 45 - 0.852558 46 - 0.85413504 47 - 0.85566247 48 - 0.85714287 49 - 0.8585786 50 - 0.859972 51 - 1.0 52 - 0.70710677 53 - 0.57735026 54 - 0.5 55 - 0.4472136 56 - 0.4082483 57 - 0.37796447 58 - 0.35355338 59 - 0.3334 60 - 0.31622776 61 - 0.30151135 62 - 0.28867513 63 - 0.2773501 64 - 0.26726124 65 - 0.2581989 66 - 0.25 67 - 0.24253562 68 - 0.23570226 69 - 0.22941573 70 - 0.2236068 71 - 0.2182179 72 - 0.21320072 73 - 0.2085144 74 - 0.20412415 75 - 0.2 76 - 0.19611613 77 - 0.19245009 78 - 0.18898223 79 - 0.18569534 80 - 0.18257418 81 - 0.1796053 82 - 0.17677669 83 - 0.17407766 84 - 0.17149858 85 - 0.16903085 86 - 0.1667 87 - 0.16439898 88 - 0.16222142 89 - 0.16012815 90 - 0.15811388 91 - 0.15617377 92 - 0.15430336 93 - 0.15249857 94 - 0.15075567 95 - 0.1490712 96 - 0.14744195 97 - 0.145865 98 - 0.14433756 99 - 0.14285715 100 - 0.14142136 101 - 0.14002801 102 - 0.13867505 103 - 0.13736056 104 - 0.13608277 105 - 0.13483997 106 - 0.13363062 107 - 0.13245323 108 - 0.13130644 109 - 0.13018891 110 - 0.12909944 111 - 0.12803689 112 - 0.12700012 113 - 0.12598816 114 - 0.125 115 - 0.12403473 116 - 0.12309149 117 - 0.12216944 118 - 0.12126781 119 - 0.120385855 120 - 0.11952286 121 - 0.11867817 122 - 0.11785113 123 - 0.11704115 124 - 0.11624764 125 - 0.11547005 126 - 0.114707865 127 - 0.11396058 128 - 0.1132277 129 - 0.11250879 130 - 0.1118034 131 - 0. 132 - 0.11043153 133 - 0.10976426 134 - 0.10910895 135 - 0.10846523 136 - 0.107832775 137 - 0.107211255 138 - 0.10660036 139 - 0.10599979 140 - 0.10540926 141 - 0.104828484 142 - 0.1042572 143 - 0.10369517 144 - 0.10314213 145 - 0.10259783 146 - 0.10206208 147 - 0.10153462 148 - 0.101015255 149 - 0.10050378 -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]