Re: some thoughts about adding transactions.
I didn't want to let this drop this on the floor, but I haven't had the time to craft a response to it either. So, just for the record I agree that transactions would be nice. I think that it is important that the solution address change visibility and concurrent transactions within multiple VMs. Also, it should be backward compatible so that applications can run without transactions. So, I think that a good solution is probably more complex than it initially looks... S On Jan 8, 2005, at 6:47 AM, Peter Veentjer - Anchor Men wrote: If have a question about transactions . Lucene doesn`t support transactions but I find it very important and I think it is possible to add some kind of rollback/commit functionality to make sure the index doesn`t corrupt.. With lucene every segment is immutable (this is a perfect starting point), so after it has been created it will remain forever in a valid state. There are 3 ways to alter the index 1) deleting documents 2) adding documents 3) optimization If I delete a document, a del file appears (but doesn`t alter the segment because it is immutable). -if crash: the del files could be deleted to do a rollback. -if succes: the del files finally will be used by the writer to skip those documents in the new segment. If a new document is added, a new segment is created (finally). -if succes: the new segment is created and the old segments can be deleted. -if crash: the new segment (maybe it`s corrupted) can be deleted to do a rollback. If the index is optimized a new segment is created based on older segments. -if succes: the old segments can be deleted. -if crash: the new segment (maybe it`s corrupted) can be deleted to do a rollback. With this information it wouldn`t be to much trouble to add some kind of rollback/transaction functionality? And how about those 'per index' files? Can these be corrupted? Can these be removed and recreated succesfully? Would it be an idea to make copies of these files and restore them if the tranaction is rollbacked? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: dotLucene (port of Jakarta Lucene to C#)
Why does it seem to you that C# is faster than Java? In any case, generally the bottleneck isn't the VM. It's the I/O to the disks... Scott The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. - George Bernard Shaw On Dec 1, 2004, at 5:42 AM, Nicolas Maisonneuve wrote: hy george is the C# lucene faster than java lucene ? (because it seems to me that C# is faster than java, isn't it ?) nicolas maisonneuve On Sun, 28 Nov 2004 21:08:30 -0500, George Aroush [EMAIL PROTECTED] wrote: Hi folks, I am please to announce the availability of dotLucene 1.4.0 RC1. dotLucene is a complete port of Jakarta Lucene to C#. The port is almost a line-by-line port and it includes the demos as well as all the JUnit tests. An index created by dotLucene is cross compatible with Jakarta Lucene and via verse. Please visit http://sourceforge.net/projects/dotlucene/ to learn more about dotLucene and to download the source code. Best regards, -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: BooleanQuery - Too Many Clases on date range.
You can use: BooleanQuery.setMaxClauseCount(int maxClauseCount); to increase the limit. On Sep 30, 2004, at 8:24 PM, Chris Fraschetti wrote: I recently read in regards to my problem that date_field:[0820483200 TO 110448] is evluated into a series of boolean queries ... which has a cap of 1024 ... considering my documents will have dates spanning over many years, and i need the granualirity of 'by day' searching, are there any reccomendations on how to make this work? Currently with query: +content_field:sometext +date_field:[0820483200 TO 110448] I get the following exception: org.apache.lucene.search.BooleanQuery$TooManyClauses any suggestions on how I can still keep the granuality of by day, but without limiting my search results? Are there any date formats that I can change those numbers to that would allow me to complete the search (i.e. Feb, 15 2004 ) .. can lucene's range do a proper search on formatted dates? Is there a combination of RangeQuery and Query/MultiTermQuery that I can use? your help is greatly appreciated. -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open-ended range queries
At one point it definitely supported null for either term. I think that has been removed/forgotten in the later revisions of the QueryParser... Scott On Jun 10, 2004, at 1:24 PM, Erik Hatcher wrote: On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote: Actually, QueryParser does support open-ended ranges like : [term TO null]. Doesn't work for the lower end of the range (though that's usually less of a problem). It supports null? Are you sure? If so, I'm very confused about it because I don't see where in the grammar it has any special handling like that. Could you show an example that demonstrates this? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Open-ended range queries
It looks to me like Revision 1.18 broke it. On Jun 10, 2004, at 3:26 PM, Erik Hatcher wrote: On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote: Well, I'm using 1.4 RC3 and the null range upper limit works just fine for searches in two of my fields; one is in the form of a cannonical date (eg, 20040610) and the other is in the form of a padded word count (e.g., 01500 for 1500). The syntax would be pub_date:[20040501 TO null] (dates later than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 or more words). Ah It works for you because you have numeric values and lexically null is greater than any of them. It is still using it as a lexical term value, and not truly making the end open-ended. This is why null doesn't work at the beginning for you either. It's just being treated as text, just like your numbers are. PS: This use of null has worked this way since at least 1.2. As I recall, way back when, null also worked as the first term limit (but no longer does). If so, then something serious broke. I've not the time to check the cvs logs on this, but I cannot imagine that we removed something like this. If anyone cares to dig up the diff where we removed/broke this, I'd be gracious. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Open-ended range queries
Well, I do like the *, but apparently there are some people that are using this with the null... Scott On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote: On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote: It looks to me like Revision 1.18 broke it. It seems this could be it: revision 1.18 date: 2002/06/25 00:05:31; author: briangoetz; state: Exp; lines: +62 -33 Support for new range query syntax. The delimiter is TO , but is optional for backward compatibility with previous syntax. If the range arguments match the format supported by DateFormat.getDateInstance(DateFormat.SHORT), then they will be converted into the appropriate date strings a la DateField. Added Field.Keyword constructor for Date-valued arguments. Optimized DateField.timeToString function. But geez June 2002 and no one has complained since? Given that this is so outdated, I'm not sure what the right course of action is. There are lots more Lucene users now than there were then. Would adding NULL back be what folks want? What about simply an asterisk to denote open ended-ness? [* TO term] or [term TO *] For completeness, here is the diff: % cvs diff -u -r 1.17 -r 1.18 QueryParser.jj Index: QueryParser.jj === RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/ QueryParser.jj,v retrieving revision 1.17 retrieving revision 1.18 diff -u -r1.17 -r1.18 --- QueryParser.jj 20 May 2002 15:45:43 - 1.17 +++ QueryParser.jj 25 Jun 2002 00:05:31 - 1.18 @@ -65,8 +65,11 @@ import java.util.Vector; import java.io.*; +import java.text.*; +import java.util.*; import org.apache.lucene.index.Term; import org.apache.lucene.analysis.*; +import org.apache.lucene.document.*; import org.apache.lucene.search.*; /** @@ -218,35 +221,30 @@ private Query getRangeQuery(String field, Analyzer analyzer, - String queryText, + String part1, + String part2, boolean inclusive) { -// Use the analyzer to get all the tokens. There should be 1 or 2. -TokenStream source = analyzer.tokenStream(field, - new StringReader(queryText)); -Term[] terms = new Term[2]; -org.apache.lucene.analysis.Token t; +boolean isDate = false, isNumber = false; -for (int i = 0; i 2; i++) -{ - try - { -t = source.next(); - } - catch (IOException e) - { -t = null; - } - if (t != null) - { -String text = t.termText(); -if (!text.equalsIgnoreCase(NULL)) -{ - terms[i] = new Term(field, text); -} - } +try { + DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT); + df.setLenient(true); + Date d1 = df.parse(part1); + Date d2 = df.parse(part2); + part1 = DateField.dateToString(d1); + part2 = DateField.dateToString(d2); + isDate = true; } -return new RangeQuery(terms[0], terms[1], inclusive); +catch (Exception e) { } + +if (!isDate) { + // @@@ Add number support +} + +return new RangeQuery(new Term(field, part1), + new Term(field, part2), + inclusive); } public static void main(String[] args) throws Exception { @@ -282,7 +280,7 @@ | #_WHITESPACE: ( | \t ) } -DEFAULT SKIP : { +DEFAULT, RangeIn, RangeEx SKIP : { _WHITESPACE } @@ -303,14 +301,28 @@ | PREFIXTERM: _TERM_START_CHAR (_TERM_CHAR)* * | WILDTERM: _TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ))* -| RANGEIN: [ ( ~[ ] ] )+ ] -| RANGEEX: { ( ~[ } ] )+ } +| RANGEIN_START: [ : RangeIn +| RANGEEX_START: { : RangeEx } Boost TOKEN : { NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )? : DEFAULT } +RangeIn TOKEN : { +RANGEIN_TO: TO +| RANGEIN_END: ] : DEFAULT +| RANGEIN_QUOTED: \ (~[\])+ \ +| RANGEIN_GOOP: (~[ , ] ])+ +} + +RangeEx TOKEN : { +RANGEEX_TO: TO +| RANGEEX_END: } : DEFAULT +| RANGEEX_QUOTED: \ (~[\])+ \ +| RANGEEX_GOOP: (~[ , } ])+ +} + // * Query ::= ( Clause )* // * Clause ::= [+, -] [TERM :] ( TERM | ( Query ) ) @@ -387,7 +399,7 @@ Query Term(String field) : { - Token term, boost=null, slop=null; + Token term, boost=null, slop=null, goop1, goop2; boolean prefix = false; boolean wildcard = false; boolean fuzzy = false; @@ -415,12 +427,29 @@ else q = getFieldQuery(field, analyzer, term.image); } - | ( term=RANGEIN { rangein=true; } | term=RANGEEX ) + | ( RANGEIN_START ( goop1=RANGEIN_GOOP|goop1=RANGEIN_QUOTED ) + [ RANGEIN_TO ] ( goop2=RANGEIN_GOOP|goop2=RANGEIN_QUOTED ) + RANGEIN_END ) + [ CARAT boost=NUMBER ] +{ + if (goop1.kind == RANGEIN_QUOTED) +goop1.image = goop1.image.substring(1, goop1
Re: DocumentWriter, StopFilter should use HashMap... (patch)
I don't buy it. HashSet is but one implementation of a Set. By choosing the HashSet implementation you are not only tying the class to a hash-based implementation, you are trying the interface to *that specific* hash-based implementation or it's subclasses. In the end, either you buy the concept of the interface and its abstraction or you don't. I firmly believe in using interfaces as they were intended to be used. Scott P.S. In fact, HashSet isn't always going to be the most efficient anyway. Just for one example: Consider possible implementations if I have only 1 or 2 entries. On Mar 10, 2004, at 11:13 PM, Erik Hatcher wrote: On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote: Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. Yes, I saw it. But is there a reason not to just expose HashSet given that it is the data structure that is most efficient? I bought into Kevin's arguments that it made sense to just expose HashSet. As for copying values - that is only happening now if you use the Hashtable or String[] constructor. Erik Doug From: Doug Cutting [EMAIL PROTECTED] Date: March 10, 2004 1:08:24 PM EST To: Lucene Developers List [EMAIL PROTECTED] Subject: Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java Reply-To: Lucene Developers List [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: - public StopFilter(TokenStream in, Set stopTable) { + public StopFilter(TokenStream in, Set stopWords) { super(in); -table = stopTable; +this.stopWords = new HashSet(stopWords); } This always allocates a new HashSet, which, if the stop list is large, and documents are small, could impact performance. Perhaps we can replace this with something like: public StopFilter(TokenStream in, Set stopWords) { this(in, stopWords instanceof HashSet ? ((HashSet)stopWords) : new HashSet(stopWords)); } and then add another constructor: private StopFilter(TokenStream in, HashSet stopWords) { super(in); this.stopWords = stopTable; } Also, if we want the implementation to always be a HashSet internally, for performance, we ought to declare the field to be a HashSet, no? The competing goals here are: 1. Not to expose publicly the implementation of the Set; 2. Not to copy the contents of the Set when folks pass the value of makeStopSet. 3. Use the most efficient implementation internally. I think the changes above meet all of these. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Index advice...
I have. While document.add() itself doesn't increase over time, the merge does. Ways of partially overcoming this include increasing the mergeFactor (but this will increase the number of file handles used), or building blocks of the index in memory and then merging them to disk. This has been discussed before, so you should be able to find additional information on this fairly easily. Scott On Feb 10, 2004, at 7:55 AM, Otis Gospodnetic wrote: --- Leo Galambos [EMAIL PROTECTED] wrote: Otis Gospodnetic napsal(a): Without seeing more information/code, I can't tell which part of your system slows down with time, but I can tell you that Lucene's 'add' does not slow over time (i.e. as the index gets larger). Therefore, I would look elsewhere for causes of the slowdown. Otis, can you point me to some proofs that time of insert operation does not depend on the index size, please? Amortized time of insert is O(log(docsIndexed/mergeFac)), I think. This would imply that Lucene gets slower as it adds more documents to the index. Have you observed this behaviour? I haven't. Thus I do not know how it could be O(1). ~ O(1) is what I have observed through experiments with indexing of several million documents. Otis AFAIK the issue with PDF files can be based on the PDF parser (I already encountered this with PDFbox). The easiest thing to do is add logging to suspicious portions of the code. That will narrow the scope of the code you need to analyze. Otis --- [EMAIL PROTECTED] wrote: Hey Lucene-users, I'm setting up a Lucene index on 5G of PDF files (full-text search). I've been really happy with Lucene so far but I'm curious what tips and strategies I can use to optimize my performance at this large size. So far I am using pretty much all of the defaults (I'm new to Lucene). I am using PDFBox to add the documents to the index. I can usually add about 800 or so PDF files and then the add loop: for ( int i = 0; i fileNames.length; i++ ) { Document doc = IndexFile.index(baseDirectory+documentRoot+fileNames [i]); writer.addDocument(doc); } really starts to slow down. Doesn't seem to be memory related. Thoughts anyone? Thanks in advance, CK Hill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: BooleanQuery question
No, you don't need required or prohibited, but you can't have both. Here is a rundown: * A required clause will allow a document to be selected if and only if it contains that clause and will exclude any documents that don't. * A prohibited clause will exclude any documents that contain that clause. * A clause that is neither prohibited nor required will select a document if it contains the clause, but the clause will not prevent non-matching documents from being selected by other clauses. Hopefully that helps, Scott On Jan 16, 2004, at 7:32 AM, Thomas Scheffler wrote: Karl Koch sagte: Hi all, why does the boolean query have a required and a prohited field (boolean value)? If something is required it cannot be forbidden and otherwise? How does this match with the Boolean model we know from theory? What if required and prohibited are both off? That's somthing we need. Are there differences between Lucene and the Boolean model in theory? To save three conditions you have to take at least 2 bits. That's for the theory. Kind regards Thomas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: java.io.IOException: Bad file number
I don't think adding extensive locking is necessary. What you are probably experiencing is that you've closed the index before you're done using it. If you aren't careful to close the index only after all searches on it have been completed, you'll get an error like this. Scott [EMAIL PROTECTED] wrote: Hello, I'm trying to debug a problem with a lucene installation which is getting java.io.IOException: Bad file number occasionally when performing searches. More specifically, the exception is coming when we are using a Reader to extract the hit Documents from the index (due to using getMessage instead of printStackTrace, I can't tell for sure if the exception is coming from opening the reader, or getting the document...arrgh!). I believe this problem is because of our design, which is that we allow ongoing multiple searches, and every 30 seconds we have a separate program which performs updates (adding and deleting documents) on the same index that is being searched. After a batch of updates are performed we close and re-open the IndexSearcher, the idea being that it should now be able to access the new documents. Is this a situation where we should have some locking in place that has searches wait while documents are being added/deleted? This would be easy enough to implement, but there is a lot of updating to do, and we don't want to sacrifice the excellent performance of the search by waiting every 30 seconds while updates happen. We've thought of two basic paths to take: 1. Implement a locking mechanism, and maybe try to add/delete one document each time the updating program aquires the lock, instead of a bigger batch. We think this might keep the search waiting the least amount of time, but updates will take longer. 2. Use a scheme with 2 indexes where we always update the one that isn't being searched in, and switch between the two. We are not sure if it makes sense to perform the switch every 30 seconds in this case. Does anyone have an idea if I am correct about the cause of the Exception, or any thoughts on the two possible solutions? We are running jdk 1.4, and lucene 1.2 on solaris. Thanks for any help you can give. Brad Hendricks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Always drink upstream from the herd. - Will Rogers - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple writers
Offhand, I would say that using 2 directories and merging them is exactly what you waht. It really shouldn't be all that complicated and Lucene should handle the synchronization for you... Scott Dror Matalon wrote: Hi folks, We're in the process of adding search to our online RSS aggregator. You can see it in action at www.fastbuzz.com. Currently we have more than five million items in the systems and it's growing at the rate of more than 100,00 a day. So we need to take into account is that the index is constantly growing. One of the things we want to build into the system is the ability to rebuild the index on the fly while still inserting the items that are coming in. We've looked at having things go into different directories and then merge them, but it seems complicated and we'd need to worry about race conditions and locking issues. Anyone's done this before? Any suggestions? Regards, Dror -- ...there is nothing more difficult to execute, nor more dubious of success, nor more dangerous to administer than to introduce a new order to things; for he who introduces it has all those who profit from the old order as his enemies; and he has only lukewarm allies in all those who might profit from the new. - Niccolo Machiavelli - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Limit on number of required/prohibited clauses
Hi Eugene, Yes. Doug (Cutting) added this to eliminate OutOfMemory errors that apparently some people were having. Unfortunately, it causes backward-compatibility issues if you were used to using version 1.2. So, you'll need to add a call like this: BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE); (Of course, you can set the parameter to whatever you want, but unrestricted works best for me.) Scott Eugene S. wrote: Hi, I've come across the limit on the number of required/prohibited clauses in a boolean query (the limit is 32). What is the reasoning for having such limit? Can it be circumvented? Thanks! Eugene. __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- All progress is initiated by challenging current conceptions, and executed by supplanting existing institutions. - George Bernard Shaw - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reuse IndexSearcher?
Yes. You can (and should for best performance) reuse an IndexSearcher as long as you don't need access to changes made to the index. An open IndexSearcher won't pick up changes to the index, so if you need to see the changes, you will need to open a new searcher at that point. Scott Aviran Mordo wrote: Can I reuse one Instance of IndexSearcher to do multiple searches (in multiple threads) or do I have to instantiate a new IndexSearcher for each search? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Make Lucene Index distributable
Be careful with option 1. NFS and the Lucene file-based locking mechanism don't get along extremely well. (See the archives for details...) Scott Lienhard, Andrew wrote: I can think of three options: 1) Single index dir on a shared drive (NFS, etc.) which is mounted on each app server. 2) Create copies of the index dir for each machine. Requires regular updates, etc (not good if search data changes often). 3) Create a web service for search. Each app server makes an HTTP call to a standalone Lucene app which returns some sort of XML-formatted search result. I've taken approaches 1 and 3 (w/ Verity, but it would likely be the same w/ Lucene). 2 is really only good if you have relatively static data. For our Lucene rollout here, we're going w/ option 1. Andrew Lienhard Web Technology Manager United Media 200 Madison Avenue New York, NY 10016 http://www.dilbert.com http://www.snoopy.com http://members.comics.com -Original Message- From: Uhl V., DP ITS, SCB, FD [mailto:[EMAIL PROTECTED] Sent: Monday, August 18, 2003 11:05 AM To: '[EMAIL PROTECTED]' Subject: Make Lucene Index distributable Hallo All, We have developed our WebApp with Lucene under Tomcat 4.X and stored index in file system. Now this Web Application have to move to Bea Weblogic Cluster. My Problem is to create a distributable Index of Lucene. Have one ideas or experience how to do this?(How to store Index?) Thanks for every ideas. Mit freundlichen Grüßen Vitali Uhl Client Server Systeme Deutsche Post ITSolutions GmbH tel. +49 (0) 661 / 921 -245 fax: +49 (0) 661 / 921 -111 internet: http://www.dp-itsolutions.de/ http://www.dp-itsolutions.de/ Anschrift: DP ITSolutions GmbH D - 36035 Fulda - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- All progress is initiated by challenging current conceptions, and executed by supplanting existing institutions. - George Bernard Shaw - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NLucene up to date ?
Do these implementations maintain file compatibility with the Java version? Scott Erik Hatcher wrote: I'd love to see there be quality implementations of the Lucene API in other languages, that are up to date with the latest Java codebase. I'm embarking on a Ruby port, which I'm hosting at rubyforge.org. There is a Python version called Lupy. A related question I have is what about performance comparisons between the different language implementations? Will Java be the fastest? Is there a test suite already available that can demonstrate the performance characteristics of a particular implementation? I'd love to see the numbers and see if even the Java version can be beat. Erik On Thursday, July 31, 2003, at 08:43 AM, [EMAIL PROTECTED] wrote: Hi all, http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2. Does anyone know if this source is still being maintained to be closer to the java developments ? Was this an external project to Apache Jakarta ? I (we) have just successfully released a search engine using a c# implmentation of Lucene. Code had to be brought up to date in line with recent java builds, and enhanced with additional features (eg field sorting, term position score factoring, etc). Any other c# users who would like to see NLucene kept in line with the java version ? Maybe I'm just being lazy with having to maintain my own version of Lucene =). Surely there are others out there who are c# users and follow the mailing lists (I remember a Brian somewhere !) but seldom post. Brendon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Luke - Lucene Index Browser
Nifty cool! I'm gonna like this, I can tell already! I'm having a really hard time actually using Luke, though, as all the window panes and table columns are apparently of fixed size. Do you think you could through in the ability to resize the various window panes and table columns? This would make the tool truly useful. Pretty please? :) Thanks, Scott Andrzej Bialecki wrote: Dear Lucene Users, Luke is a diagnostic tool for Lucene (http://jakarta.apache.org/lucene) indexes. It enables you to browse documents in existing indexes, perform queries, navigate through terms, optimize indexes and more. Please go to http://www.getopt.org/luke and give it a try. A Java WebStart version will be available soon. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incremental indexing
+1. Support for transactions in Lucene are high on my list of desirable features as well. I would love to have time to look into adding this, but lately... well, you know how that goes. Scott Eric Jain wrote: If you want to update a set of documents, you can remove their previous version first and then add them after that. In the mean time documents of this set are temporaly not available. If you have to update a single document and make the changes immediately public, I don't know a better solution than yours. Thanks. I'm not so much worried about temporary inconsistencies as the index is maintained separately. Of course it would be great if Lucene provided direct support for some kind of transactional integrity! Anyways, removing all changed documents first means I have to scan through all documents twice, not very efficient, though in fact faster than the procedure I described. -- Eric Jain -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- Brain: Pinky, are you pondering what Im pondering? Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were they thinking? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: How does delete work?
It just marks the record as deleted. The record isn't actually removed until the index is optimized. Scott Rob Outar wrote: Hello all, I used the delete(Term) method, then I looked at the index files, only one file changed _1tx.del I found references to the file still in some of the index files, so my question is how does Lucene handle deletes? Thanks, Rob -- To unsubscribe, e-mail: For additional commands, e-mail: -- Brain: Pinky, are you pondering what Im pondering? Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were they thinking? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Fun project?
I'm rather partial to Jini for distributed systems, but I agree that JXTA would definitely be the way to go on this type of peer-to-peer scenario. Scott [EMAIL PROTECTED] wrote: I'll be doing something very similar some time in the next 12 months for the project I'm working on. I'll be more than happy than happy to contribute the code when its done, but the rest of the project has been implemented with CORBA, and it had been my plan to use CORBA for the distributed index servers as well. I'll look into JXTA though, as I hadn't come across it before. Kiril Otis Gospodnetic 21/11/2002 16:57 Please respond to Lucene Users List To: Lucene Users List cc: Subject:Re: Fun project? Yeah, I thought of that, too. JXTA is the P2P piece that you are asking about. A recent post on Slashdot mentioned something that IBM did that sounds similar. Time... :) Otis --- Robert A. Decker wrote: I wish I had time to work on this for fun, but I was thinking about what could be a fun lucene project... One could build a peer-to-peer document search application. Each client would index the documents on its harddrive, or documents in a particular directory. When the user at the computer does a search it will look at the documents on its harddrive, but also send out a request for the search on the P2P network. First though, are there any P2P java frameworks out there? One could build one, perhaps with OpenJMS, but it would be nice if one already existed. Hmm... if anyone else thinks this would be cool I'd be willing to work on this with you. thanks, Robert A. Decker http://www.robdecker.com/ http://www.planetside.com/ -- To unsubscribe, e-mail: For additional commands, e-mail: __ Do you Yahoo!? Yahoo! Mail Plus ? Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: For additional commands, e-mail: -- To unsubscribe, e-mail: For additional commands, e-mail: -- Brain: Pinky, are you pondering what Im pondering? Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were they thinking? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Searching Ranges
Hi Alex, I just looked at this and had the following thought: The RangeQuery must continue to iterate after the first match is found in order to match everything within the specified range. In other words, if you have a range of a to d, you can't stop with a, you need to continue to d. At the point you move beyond d is the point where the query should stop iterating. That is reflected in lines 160-162. It seems to me that your solution would only work where your range consists of a single term. Please let me know if I'm just misunderstanding the situation. Scott Alex Winston wrote: thanks for the reply, my apologizes for not explaining myself very clearly, it has been a long day. you expressed exactly our situation, unfortunately this is not an option because we want to have multiple ranges for each document as well, there is a possible extension of what you suggested but that is a last resort. kinda crazy i know, but you have to meet requirements :). but i also had a thought while i was looking through the lucene code, and any comments are welcome. i may be very mistaken because it has been a long day but if you look at the current cvs version of RangeQuery it appears that even if a match is found it will continue to iterate over terms within a field, and in my case it is on the order of thousands. if i add a break after a match has been found it appears as though the search is improved on avg an order of magnitude, my math has left me so i cannot be theoretical at the moment. i have unit tested the change on my side and on the lucene side and it works. note: one hard example is that a query went from 20 seconds to .5 seconds. any initial thoughts to if there is a case where this would not work? beginning line 164: TermQuery tq = new TermQuery(term); // found a match tq.setBoost(boost); // set the boost q.add(tq, false, false); // add to q break; // ADDED! On Fri, 2002-11-08 at 15:09, Mike Barry wrote: Alex, It is rather confusing. It sounds like you've indexed a field that that can be between two values (let's say E-J) and then when you have a search term such as G you want the docs containing E-J (or A-H or F-K but not A-H nor A-C nor J-Z) Just of the top of my head but could you index the upper and lower bounds as separate fields then when you search do a compound query: lower_bound:{ - search_term } AND upper_bound:{ search_term - } just a thought. -MikeB. Alex Winston wrote: i was hoping that someone could briefly review my current solution to a problem that we have encountered to see if anyone could suggest a possible alternative, because as it stands we have pushed lucene past its current limits. PROBLEM: we were wanting to represent a range of values for a particular field that is searchable over a particular range. an example follows for clarification: we were wanting to store a range of chapters and verses of a book for a particular document, and in turn search to see if a query range includes the range that is represented in the index. if this is unclear please ask for clarification IMPRACTICAL SOLUTION: although this solution seems somewhat impractical it is all we could come up with. our solution involved storing each possible range value within the term which would allow for RangeQuerys to be performed on this particular field. for very small ranges this seems somewhat practical after profiling. although once the field ranges began to span multiple chapters and verses, the search times became unreasonable because we were storing thousands of entries for each representative range. i can elaborate on anything that is unclear, but any thoughts on a possible alternative solution within lucene that we overlooked would be extremely helpful. alex -- To unsubscribe, e-mail: For additional commands, e-mail: -- Brain: Pinky, are you pondering what Im pondering? Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were they thinking? -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Re: Your experiences with Lucene
Actually, 10k isn't very large. We have indexes with more than 1M records. It hasn't been a problem. Scott Tim Jones wrote: Hi, I am currently starting work on a project that requires indexing and searching on potentially thousands, maybe tens of thousands, of text documents. I'm hoping that someone has a great success story about using Lucene for a project that required indexing and searching of a large number of documents. Like maybe more than 10,000. I guess what I'm trying to figure out is if Lucene's performance will be acceptable where the number of documents is very large. I realize this is a very general question but I just need a general answer. Thanks, Tim J. -- Brain: Pinky, are you pondering what Im pondering? Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were they thinking? -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
RE: Using Filters in Lucene
Cool. But instead of adding a new class, why not change Hits to inherit from Filter and add the bits() method to it? Then one could pipe the output of one Query into another search without modifying the Queries... Scott -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Monday, July 29, 2002 12:03 PM To: Lucene Users List Subject: Re: Using Filters in Lucene Peter Carlson wrote: Would you suggest that search in selection type functionality use filters or redo the search with an AND clause? I'm not sure I fully understand the question. If you a condition that is likely to re-occur commonly in subsequent queries, then using a Filter which caches its bit vector is much faster than using an AND clause. However, you probably cannot afford to keep a large number of such filters around, as the cached bit vectors use a fair amount of memory--one bit per document in the index. Perhaps the ultimate filter is something like the attached class, QueryFilter. This caches the results of an arbitrary query in a bit vector. The filter can then be reused with multiple queries, and (so long as the index isn't altered) that part of the query computation will be cached. For example, RangeQuery could be used with this, instead of using DateFilter, which does not cache (yet). Caution: I have not yet tested this code. If someone does try it, please send a message to the list telling how it goes. If this is useful, I can document it better and add it to Lucene. Doug
RE: Too many open files?
Are you closing the searcher after each when done? No: Waiting for the garbage collector is not a good idea. Yes: It could be a timeout on the OS holding the files handles. Either way, the only real option is to avoid thrashing the searchers... Scott -Original Message- From: Hang Li [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 23, 2002 10:10 AM To: Lucene Users List Subject: Re: Too many open files? Thanks for your quick reponse, I still want to know why we ran out of file descriptors. --Yup. Cache and reuse your Searcher as much as possible. --Scott -Original Message- From: Hang Li [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 23, 2002 9:59 AM To: Lucene Users List Subject: Too many open files? I have seen a lot postings about this topic. Any final thoughts? We did a simple stress test, Lucene would produce this error between 30 - 80 concurren searches. The index directory has 24 files (15 fields), and ulimit -n 32768 , there should be more than enough FDs. Note, we did not do any writings to index while we were searching. Any ideas? Thx. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Forked files? was: RE: Too many open files?
Another idea to address this (quite common) problem: Does anyone know if there are any Java file implementations that support a forked file or a file with multiple streams? Or, if not, do you know of any design patterns or documents explaining the theory and design in this kind of thing? It would seem that if there was an efficient implementation of a forked file, perhaps that could be used instead of the set of files that Lucene currently uses to represent a segment. Scott -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 23, 2002 10:13 AM To: 'Lucene Users List' Subject: RE: Too many open files? Are you closing the searcher after each when done? No: Waiting for the garbage collector is not a good idea. Yes: It could be a timeout on the OS holding the files handles. Either way, the only real option is to avoid thrashing the searchers... Scott -Original Message- From: Hang Li [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 23, 2002 10:10 AM To: Lucene Users List Subject: Re: Too many open files? Thanks for your quick reponse, I still want to know why we ran out of file descriptors. --Yup. Cache and reuse your Searcher as much as possible. --Scott -Original Message- From: Hang Li [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 23, 2002 9:59 AM To: Lucene Users List Subject: Too many open files? I have seen a lot postings about this topic. Any final thoughts? We did a simple stress test, Lucene would produce this error between 30 - 80 concurren searches. The index directory has 24 files (15 fields), and ulimit -n 32768 , there should be more than enough FDs. Note, we did not do any writings to index while we were searching. Any ideas? Thx. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: CachedSearcher
I'd like to see the finalize() methods removed from Lucene entirely. In a system with heavy load and lots of gc, using finalize() causes problems. To wit: 1) I was at a talk at JavaOne last year where the gc performance experts from Sun (the engineers actually writing the HotSpot gc) were giving performance advice. They specifically stated that finalize() should be avoided if at all possible because the following steps have to happen for finalized objects: a) register the object when created b) notice the object when it becomes unreachable c) finalize the object d) notice the object when it becomes unreachable (again) e) reclaim the object This leads to the following effects in the vm: a) allocation is slower b) heap is bigger c) gc pauses are longer The Sun engineers recommended that if you really do need an automatic clean up process, that Weak references are *much* more efficient and should be used in preference to finalize(). 2) External resources (i.e. file handles) are not released until the reader is closed. And, as many have found, Lucene eats file handles for breakfast, lunch, and dinner. Scott -Original Message- From: Halcsy Pter [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 16, 2002 12:43 AM To: Lucene Users List Subject: RE: CachedSearcher -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 16, 2002 1:00 AM To: Lucene Users List Subject: Re: CachedSearcher Why is this more complicated than the code in demo/Search.jhtml (included below)? FSDirectory closes files as they're GC'd, so you don't have to explicitly close the IndexReaders or Searchers. I'll check this code, but I think it could hang up with a lot of opened IndexReader. http://developer.java.sun.com/developer/TechTips/2000/tt0124.html (If a lot of searcher is requested ant a writer is always modificating the index). peter -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: CachedSearcher
Point taken. Indeed, these were general recommendations that may/may not have a strong impact on Lucene's specific use of finalization. My only specific performance claim is that there will be a negative impact of some degree using finalizers. Whether that impact is noticable or not will probably depend upon a number of factors. So I will avoid making any further judgements on the impact of finalization in Lucene on the performance until I have proof. Benchmarks aside, my point on the file handles is something that hit us square between the eyes. Before we started caching and explicitly closing our Searchers we would regularly run out of file handles because of Lucene. This was despite increasing our allocated file handles to ludicrous levels in the OS. I would recommend that, in general, Java developers would be well advised to explicitly release external resources when done with them rather than allowing finalization to take care of it. Scott -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 16, 2002 11:56 AM To: Lucene Users List Subject: Re: CachedSearcher Scott Ganyo wrote: I'd like to see the finalize() methods removed from Lucene entirely. In a system with heavy load and lots of gc, using finalize() causes problems. [ ... ] External resources (i.e. file handles) are not released until the reader is closed. And, as many have found, Lucene eats file handles for breakfast, lunch, and dinner. Lucene does open and close lots of files relative to many other applications, but the number of files opened is still many orders of magnitude less than the number of other objects allocated. I would be very surprised if finalizers for the hundreds of files that Lucene might open in a session would have any measurable impact on garbage collector performance given the millions of other objects that the garbage collector might process in that session. As usual, one should not make performance claims without performing benchmarks. It would be a simple matter to comment out the finalize() methods, recompile and compare indexing and search speed. If the improvement is significant, then we can consider removing finalize methods. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: IndexReader Pool
Deadlocks could be created if the order in which locks are obtained is not consistent. Note, though, that the locks are obtained in the same order each time throughout. (BTW: The inner lock is merely needed because the wait/notify calls need to own the monitor.) Naturally, you are free to make any suggestions for improvement! :) Scott -Original Message- From: Ilya Khandamirov [mailto:[EMAIL PROTECTED]] Sent: Saturday, July 06, 2002 11:24 AM To: 'Lucene Users List' Subject: RE: IndexReader Pool You are correct. Actually, there have been a few bug fixes since that was posted. Here's a diff to an updated version: Well, i do not see your actual version of this file, but it looks like now you have two synchronized blocks: synchronized ( sync ) ... synchronized ( info ) This may produce deadlocks in a multithreading environment. Have you already solved this problem or i should take a closer look at it? Hope it helps, Sure. Thank you. Scott Regards, Ilya -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Stress Testing Lucene
Which came first--the out of file handles error or the corruption? I haven't looked, but I would guess that if you ran into the file handles exception while writing, that might leave Lucene in a bad state. Lucene isn't transactional and doesn't really have the ACID properties of a database... -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 11:45 PM To: Lucene Users List Subject: RE: Stress Testing Lucene I rebooted my machine and still the same issue .. if I know what caused that to happen, I would be able to solve it with some source tweaking, and it's not the files handles on the machine I got over that problem months ago. Let's consider worst case scenario and that corruption did occur what could be the reasons, I'm goig to need some insider help to get through this one. N. -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 7:15 PM To: 'Lucene Users List' Subject: RE: Stress Testing Lucene 1) Are you sure that the index is corrupted? Maybe the file handles just haven't been released yet. Did you try to reboot and try again? 2) To avoid the too-many files problem: a) increase the system file handle limits, b) make sure that you reuse IndexReaders as much as you can rather across requests and client rather than opening and closing them. -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 10:11 AM To: [EMAIL PROTECTED] Subject: Stress Testing Lucene Importance: High Hey people, I'm running a Lucene (v1.2) servlet on resin and I must say compared to Oracle Intermedia it's working beautifully. BUT today, I started stress testing and I downloaded a program called Web Roller, witch simulates clients, requests , multi-threading .. the works and I was testing I was doing something like 50 simultaneous requests and I was repeating that 10 times in a row. but then something happened and the index got corrupted, every time I try opening the index with the reader to search or open with the writer to optimize I get that damned too-many files open error. I can imagine that every application on the market has a breaking point and these breaking points have side effects, so is the corruption of the index a side effect and if so is there a way that I configure my web server to crash before the corruption occurs, I'd rather re-start the web server and throw some people off wack rather that have to re-build the index or revert to an older version. Do you know of any way to safeguard against this ? General Info: The index is about 45 MB with 60 000 XML files each containing 18-25 fields. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Stress Testing Lucene
1) Are you sure that the index is corrupted? Maybe the file handles just haven't been released yet. Did you try to reboot and try again? 2) To avoid the too-many files problem: a) increase the system file handle limits, b) make sure that you reuse IndexReaders as much as you can rather across requests and client rather than opening and closing them. -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 10:11 AM To: [EMAIL PROTECTED] Subject: Stress Testing Lucene Importance: High Hey people, I'm running a Lucene (v1.2) servlet on resin and I must say compared to Oracle Intermedia it's working beautifully. BUT today, I started stress testing and I downloaded a program called Web Roller, witch simulates clients, requests , multi-threading .. the works and I was testing I was doing something like 50 simultaneous requests and I was repeating that 10 times in a row. but then something happened and the index got corrupted, every time I try opening the index with the reader to search or open with the writer to optimize I get that damned too-many files open error. I can imagine that every application on the market has a breaking point and these breaking points have side effects, so is the corruption of the index a side effect and if so is there a way that I configure my web server to crash before the corruption occurs, I'd rather re-start the web server and throw some people off wack rather that have to re-build the index or revert to an older version. Do you know of any way to safeguard against this ? General Info: The index is about 45 MB with 60 000 XML files each containing 18-25 fields. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Boolean Query + Memory Monster
Use the java -Xmx option to increase your heap size. Scott -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 13, 2002 12:20 PM To: [EMAIL PROTECTED] Subject: Boolean Query + Memory Monster I have 1 Geg of memory on the machine with the application when I use a normal query it goes well, but when I use a range query it sucks the memory out of the machine and throws a servlet out of memory error, I have 80 000 records in the index and it's 43 MB large anything people ? Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Queryparser croaking on [ and ]
Actually, [] denotes an inclusive range of Terms. Anyway, why not change the syntax if this is bad...? Scott -Original Message- From: Brian Goetz [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 20, 2002 10:08 AM To: Lucene Users List Subject: Re: Queryparser croaking on [ and ] This is because the query parser uses [] to denote ranges of numbers. (I always thought this was a bad choice of syntax for exactly this reason.) On Wed, Feb 20, 2002 at 11:14:05AM -, Les Hughes wrote: Hi, I'm currently building a small app that allows searching of Java sourcecode. The problem I'm getting is when parsing a query string that contains an array specifier (ie. String[] or int[][]) the query parser seem to croak with a Lexical error at line XX, column XX. Encountered: after : [] So what am I doing wrong / what should I write to fix this? Les -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: JDK 1.1 vs 1.2+
+1 -Original Message- From: Matt Tucker [mailto:[EMAIL PROTECTED]] Sent: Tuesday, January 22, 2002 11:06 AM To: 'Lucene Users List' Subject: RE: JDK 1.1 vs 1.2+ Hey all, I'd just like to chime in support for dropping JDK 1.1, especially if it would aid i18n in Lucene. There just doesn't seem to be a compelling reason to build anything for JDK 1.1 anymore. Regards, Matt Jive Software -Original Message- From: Andrew C. Oliver [mailto:[EMAIL PROTECTED]] Sent: Tuesday, January 22, 2002 10:52 AM To: Lucene Users List Subject: JDK 1.1 vs 1.2+ Hello everyone, I originally posted this question to the developers list, but was asked to repeat it here. I'm working on some new functionality I plan to submit for Lucene. In doing this I've noticed that Lucene currently maintains compatibility with JDK 1.1. This has some disadvantages for instance the use of vector versus some of the new collections. Next, some of the functionality I plan to add requires JDK 1.2. Finally, some of the internationalization features of Java do not work well in 1.1. For these reasons I suggest a move to 1.2+. While it seems reasonable to me to drop support for a 4 year old version of the JDK, I realize it may still present a problem to some users and would like to raise a discussion on this. How many people are still using 1.1 and would be negatively affected by Lucene's use of 1.2 features? Of those, how many people can not move to 1.2 for server side development? -Andy -- www.superlinksoftware.com www.sourceforge.net/projects/poi - port of Excel format to java http://developer.java.sun.com/developer/bugParade/bugs/4487555 .html - fix java generics! The avalanche has already started. It is too late for the pebbles to vote. -Ambassador Kosh -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Industry Use of Lucene?
We use Lucene extensively as a core part of our ASP product here at eTapestry. In fact, we've built our database query engine on top of it. We have been extremely pleased with the results. Scott Jeff Kunkle asks: Does anyone know of any companies or agencies using Lucene for their products/projects? I am attempting to make a marketing pitch for Lucene to my manager and I know one of the first questions will be, Who else is using it? I know Lucene is a very powerful, fast, and flexible full-text search engine but my manager will need a little more coercing. Any help on this topic is greatly appreciated. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Problems with prohibited BooleanQueries
I don't use a query parser at all, so that's no issue. I just need a BooleanQuery to realize that it only has negative clauses and do the right thing. Right now I have to include a bogus static field in every single document so that I can use a TermQuery on that bogus field as the left side of a BooleanQuery subtract. Sure, it works, but it ain't pretty... Scott -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 01, 2001 10:49 AM To: 'Lucene Users List' Subject: RE: Problems with prohibited BooleanQueries From: Scott Ganyo [mailto:[EMAIL PROTECTED]] How difficult would it be to get BooleanQuery to do a standalone NOT, do you suppose? That would be very useful in my case. It would not be that difficult, but it would make queries slow. All terms not containing a term would need to be enumerated. Since most terms occur in only a small percentage of the documents, most NOT queries would return most documents. Scoring would also be strange. I guess you'd give them all a score of 1.0, and hope that the query is nested in a more complex query that will differentiate the scores. But if it's nested, then you could do it with BooleanQuery as it stands... So, my question to you is: do you actually want lists of all documents that do not contain a term, or, rather, do you want to use negation in the context of other query terms, and are having trouble getting your query parser to build BooleanQueries? Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: File Handles issue
P.S. At one point I tried doing an in-memory index using the RAMDirectory and then merging it with an on-disk index and it didn't work. The RAMDirectory never flushed to disk... leaving me with an empty index. I think this is because of a bug in the mechanism that is supposed to copy the segments during the merge, but I didn't follow up on this. That should work, it should be faster and would use a lot less memory than the approach you describe above. Can you please submit a simple test case illustrating the failure? Something self-contained would be best. Ok. This will fail: import java.io.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; import org.apache.lucene.document.*; import org.apache.lucene.store.*; public class LuceneRAMDirectoryTest { public static void main(String args[]) { try { // create index in RAM RAMDirectory ramDirectory = new RAMDirectory(); Analyzer analyzer = new SimpleAnalyzer(); IndexWriter ramWriter = new IndexWriter(ramDirectory, analyzer, true); try { for (int i = 0; i 100; i++) { Document doc = new Document(); doc.add(Field.Keyword(field1, + i)); ramWriter.addDocument(doc); } } finally { ramWriter.close(); } // then merge into file File file = new File(index); boolean missing = !file.exists(); if (missing) file.mkdir(); IndexWriter fileWriter = new IndexWriter(file, analyzer, true); try { fileWriter.addIndexes(new Directory[] { ramDirectory }); } finally { fileWriter.close(); } } catch (Exception e) { e.printStackTrace(); } } }
RE: Trying To Understand Query Syntax Details
Not sure about the rest, but if you've stored your dates in mmdd format, you can use a RangeQuery like so: dateField:[20011001-null] This would return all dates on or after October 1, 2001. Scott -Original Message- From: W. Eliot Kimber [mailto:[EMAIL PROTECTED]] Sent: Tuesday, October 16, 2001 11:10 AM To: lucene-user Subject: Trying To Understand Query Syntax Details I'm trying to understand the details of the query syntax. I found the syntax ` in QueryParser.jj, but it doesn't make everything clear. My initial questions: - It doesn't appear that ? can be the last character in a search. For example, to match fool and food, I tried to do foo?, but got a parse error. fo?l of course matches fool and foal. Is this a bug or an implementation constraint? - How does one specify a date range in a query? We need to be able to search on docs later than date x, and I know that Lucene supports date matching, but I don't see how to specify this in a query. Also, is there a description of the algorithm ~ uses? Thanks, E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m
RE: File Handles issue
Thanks for the detailed information, Doug! That helps a lot. Based on what you've said and on taking a closer look at the code, it looks like by setting mergeFactor and maxMergeDocs to Integer.MAX_VALUE, an entire index will be built in a single segment completely in memory (using the RAMDirectory) and then flushed to disk when closed. Given enough memory, it would seem that this would be the fastest setting (as well as using a minimum of file handles). Would you agree? Thanks, Scott P.S. At one point I tried doing an in-memory index using the RAMDirectory and then merging it with an on-disk index and it didn't work. The RAMDirectory never flushed to disk... leaving me with an empty index. I think this is because of a bug in the mechanism that is supposed to copy the segments during the merge, but I didn't follow up on this.
File Handles issue
We're having a heck of a time with too many file handles around here. When we create large indexes, we often get thousands of temporary files in a given index! Even worse, we just plain run out of file handles--even on boxes where we've upped the limits as much as we think we can! We've played around with various settings for the mergeFactor and maxMergeDocs, but these seem to have at best an indirect effect on the number of temporary files created. I'm not very familiar with the Lucene file system yet, so can someone briefly explain how Lucene works on creating an index? How does it determine when to create a new temporary file in the index and when does it decide to compress the index? Also, is there any way we could limit the number of file handles used by Lucene? This is becoming a huge problem for us, so any insight would be appreciated. Thanks, Scott