Bug 23650 (aka docs out of order)?
Re: http://issues.apache.org/bugzilla/show_bug.cgi?id=23650 Hello, I'm pretty confident that I'm misusing Lucene one way or another... and of course it was just a question of time before I ran into this docs out of order exception: java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java: 353) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java: 316) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java: 290) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java: 254) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:93) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:389) Still... the question is... which sort of misuse would trigger such exception? For the record, this is using Lucene 1.4.3 on the following platform: JVM: Java HotSpot(TM) Client VM 1.5.0-beta Language: English (United States) Encoding: Cp1252 Memory: 17 MB Implementation: Sun Microsystems Inc. OS: Windows XP 5.1 Architecture: X86 Any insight much appreciated :) Thanks! Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ngramj
On Feb 24, 2005, at 14:50, Gusenbauer Stefan wrote: Does anyone know a good tutorial or the javadoc for ngramj because i need it for guessing the language of the documents which should be indexed? http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/ languageidentifier/ Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Jan 24, 2005, at 00:10, Vic wrote: (Is there a btree seralization impl in java?) http://jdbm.sourceforge.net/ Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Jan 22, 2005, at 23:50, Kevin A. Burton wrote: The problem I think for everyone right now is that 32bits just doesn't cut it in production systems... 2G of memory per process and you really start to feel it. Hmmm... no... no pain at all... or perhaps you are implying that your entire system is running on one puny JVM instance... in that case, this is perhaps more of a design problem than an implementation one... YMMV... Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene appreciation
On Dec 16, 2004, at 17:26, Rony Kahan wrote: If you are interested in Lucene work you can set up an rss feed or email alert from here: http://www.indeed.com/search?q=lucenesort=date Looks great :) One thing though, the web search returns 14 hits for the above query. Using the RSS feed only returns 4 of them. What gives? Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
On Dec 14, 2004, at 15:40, Kevin L. Cobb wrote: Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. ZOE [1] [2] takes the same approach and uses Lucene as a relational engine of sort. However, for both practical and ideological reasons, its does not store any raw data in the Lucene indices themselves but instead uses JDBM [2] for that purpose. All things considered, update issues aside, Lucene turns out to be a very flexible thin database. Cheers, PA. [1] http://zoe.nu/ [2] http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ [3] http://jdbm.sourceforge.net/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[RFE] IndexWriter.updateDocument()
Well, the subject says it all... If there is one thing which is overly cumbersome in Lucene, it's updating documents, therefore this Request For Enhancement: Please consider enhancing the IndexWriter API to include an updateDocument(...) method to take care of all the gory details involved in such operation. Thanks in advance. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GETVALUES +SEARCH
On Dec 01, 2004, at 13:37, Karthik N S wrote: We create a ArrayList Object and Load all the Hit Values into them and return the same for Display purpose on a Servlet. Talking of which... It would be very handy if org.apache.lucene.search.Hits would implement the java.util.List interface... in addition, org.apache.lucene.document.Document could implement java.util.Map... That way, the rest of the application could pretend to simply have to deal with a List of Maps, without having to get exposed to any Lucene internals... Thought? Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GETVALUES +SEARCH
On Dec 01, 2004, at 20:06, Erik Hatcher wrote: I also extensively use multiple fields of the same name. Odd... on the other hand... perhaps this is une affaire de gout... So does this rule out implementing the Map interface on Document? Why? Nobody mentioned what value such a Map would hold... in the worst case scenario it could hold a Collection... or perhaps its not worth bothering with such esoterism and simply state that the DocumentMap only supports one value per key... after all... the purpose of providing standard interface such as List and Map is to simplify things... not to make them more cumbersome... PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GETVALUES +SEARCH
On Dec 01, 2004, at 20:43, Erik Hatcher wrote: Sure, I could put it all together as a space separated String and use the WhitespaceAnalyzer, but why not do it this way? What other suggestions do you have for doing this? If this works for you, I don't see any problem with it. In general, I avoid storing any raw data in a Lucene Document. And only uses Lucene for, er, indexing... but this is just me :) But lets go back to that fabled Map interface for Document... if the purpose of such interface is to keep thing simple it could behave just like Document.get() [1]: Returns the string value of the field with the given name if any exist in this document, or null. If multiple fields exist with this name, this method returns the first value added. If for some reason(s) you need multiple values per field, stick with getFields()... What's wrong with that? PA. [1] http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/ Document.html#get(java.lang.String) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GETVALUES +SEARCH
On Dec 01, 2004, at 21:14, Chris Hostetter wrote: The real question in my mind is not how should we impliment 'get' given that we allow multiple values?, a better question is how should we impliment 'put'? Yes, retrofitting Document.add() in the Map interface would be a pain. But this is not really what I was getting at. This is more about Hits and accessing its values. One problem at the time :) If you think you know how to satisfy 90% of the users, i would still suggest that instead of making Codument impliment Map, instead add a toMap() functin that returns a wrapper with the rules that you think make sense. (and leave the Document API uncluttered of the Map functions that people who don't care about Map don't need to see) Agree. Document is fine as it is. It would be nice though to have a more or less standard interface to access the result set (e.g. Collection)... as consumers of Hits are more likely to be build in terms of the Collection API than anything specific to Lucene... PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[OT] Re: Lots Of Interest in Lucene Desktop
On Oct 28, 2004, at 20:26, Kevin A. Burton wrote: http://www.peerfear.org/rss/permalink/2004/10/28/ LotsOfInterestInLuceneDesktop/ Many people, few ideas :) http://www.popsearch.net/index.html PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Google Desktop Could be Better
On Oct 15, 2004, at 16:10, Tom Cunningham wrote: I'd be interested in trying to implement some of these ideas on Mac OS X, mostly because it's not already covered by Google Desktop, and I think the screensaver idea would work pretty well there. Anyone else want to give this a shot? Google invades (Windows) desktops: what's the Mac plan? http://www.bmannconsulting.com/node/1350 Google Desktop Search - It's About Time, But Not Complete http://bradnickel.com/?q=node/view/105 On the other hand, Apple is introducing Spotlight in their next Mac OS X iteration: http://www.apple.com/macosx/tiger/spotlight.html http://www.apple.com/macosx/tiger/spotlighttech.html While waiting for Godot, you may want to consider the existing Search Kit framework as an alternative to Lucene for Mac OS X specific tasks: http://developer.apple.com/documentation/UserExperience/Reference/ SearchKit/ Cheers, PA. -- http://zoe.nu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Encrypted indexes
On Oct 13, 2004, at 15:26, Nader Henein wrote: Well, are you storing any data for retrieval from the index, because you could encrypt the actual data and then encrypt the search string public key style. Alternatively, write your index to an encrypted volume... something along the line of FileVault and PGP Disk [1] [2]. PA. [1] http://www.apple.com/macosx/features/filevault/ [2] http://www.pgp.com/products/desktop/index.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
alternative query syntax?
Hello, I would like to provide an alternative query syntax for ranges by using a colon (':') or two dots ('..') instead of ' TO '. For example: mod_date:[20020101:20030101] Or mod_date:[20020101..20030101] What would be the correct procedure to modify the QueryParser to achieve this? Should I simply change QueryParser.jj's RANGEIN_TO and RANGEEX_TO to the appropriate character sequence and regenerate the corresponding Java classes with JavaCC? Any pointers appreciated as I'm not familiar with JavaCC :) TIA. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote: You also have a large number of fields, and it looks like a lot (all?) of them are stored and indexed. That's what that large .fdt file indicated. That file is 206 MB in size. Try using Field.UnStored() to avoid storing all those data in your indices as it's usually not necessary. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)
On May 20, 2004, at 04:38, Erik Hatcher wrote: OffTopic: havoc and Struts go well together ;) Pick up Tapestry instead! Nah. Keep it really Simple [1] instead :o) http://simpleweb.sourceforge.net/ PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)
On Apr 13, 2004, at 02:45, Kevin A. Burton wrote: He mentioned that I might be able to squeeze 5-10% out of index merges this way. Talking of which... what strategy(ies) do people use to minimize downtime when updating an index? My current strategy is as follow: (1) use a temporary RAMDirectory for ongoing updates. (2) perform a copy on write when flushing the RAMDirectory into the persistent index. The second step means that I create an offline copy of a live index before invoking addIndexes() and then substitute the old index with the new, updated, one. While this effectively increase the time it takes to update an index, it nonetheless reduce the *perceived* downtime for it. Thoughts? Alternative strategies? TIA. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Feb 12, 2004, at 16:42, Abhay Saswade wrote: How about creating spellcheck dictionary with all words in lucene index? That way you ensure that the word really exists in the index. You can indeed use the terms identified by Lucene as the dictionary words ands apply traditional spell checking tricks like phonetic encodings, Levinstein distance and so on. This approach works reasonably well in practice. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index advice...
On Feb 10, 2004, at 14:03, Scott ganyo wrote: I have. While document.add() itself doesn't increase over time, the merge does. Ways of partially overcoming this include increasing the mergeFactor (but this will increase the number of file handles used), or building blocks of the index in memory and then merging them to disk. This has been discussed before, so you should be able to find additional information on this fairly easily. This is what I noticed also: adding documents by itself is a fairly benign operation, but anything that triggers an index merge in one form or another is a killer as an index grows in size. So, overall, adding more documents does slow down the indexing. At least this is the impression I get. But I would love to be proven wrong on this :) Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index: how to store binary data or objects ?
On Feb 10, 2004, at 14:53, Markus Brosch wrote: My application will deal with small data sets. The problem is, that I want to index the content (String) of some objects. I want to refer to that object once I found this by a keyword or whatever. So, using a simple map or tree? Something along these lines: - When indexing your object, you create one Lucene document for it and store its unique identifier as a keyword along side whatever you want to index. - When retrieving your documents, you can use this keyword to reference your object. Another problem is, that my objects can change their content and must be reindexed. Is it possible to remove the single index for that object and build a new one without reindexing all? Yes. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[OT] Re: Need Advices and Help
On Feb 05, 2004, at 13:01, Otis Gospodnetic wrote: I believe it would be the value of a 'Message-ID' or 'Reference' or 'Reference-ID' message header. However, I remember reading that mail readers are not very good at sticking to a standard (some RFC, I guess), so they don't always provide the corrent ID, or they store it under non-standard names, etc. My suggestion: Look up Zoe (see Lucene Powered By page), download it, check its source and learn from it. http://zoe.nu/itstories/story.php?data=storiesnum=24sec=3 And be ready for a lot of pain and suffering ;) Trying to normalize email is not for the faint-hearted. Just my 2¢. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[OT] Digital Format-Specific Validation
http://hul.harvard.edu/jhove/ Might be of interest to some :) Cheers, PA. smime.p7s Description: S/MIME cryptographic signature
moving documents from one index to another?
Hello, I'm trying to move a Document from one Index to another, without necessarily reindexing it... The Document is composed of one Field.Keyword and a bunch of Field.UnStored. Reading such a Document from one index and then adding it to another one doesn't seems to have the expected effect though. Assuming that 'aReader' and 'aWriter' works on different indices: aDocument = aReader.document( index ); aWriter.addDocument( aDocument ); The Document added to the second index doesn't seem to preserve its informations... What gives? Should I do that at a lower level? Does it make sense in the first place to try to move a raw Lucene's Document between indices? TIA. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: moving documents from one index to another?
On Nov 20, 2003, at 13:45, Eric Jain wrote: If the document contains unstored fields, the only way to reconstruct the document is by iterating through all terms in the index and picking out those that reference the document. Hmmm... how would you do that? Something along the lines of aReader.terms() and then for each Term use aReader.termDocs() to try to figure out which document it belongs to? Something else altogether? How do you move the doc/terms to the other index then? This is likely to be to inefficient for any practical purposes... That's ok :) Alternatively, would it be possible to use FieldsReader/FieldsWriter or such to move the raw data from one index to the other without ill side effects? TIA. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: moving documents from one index to another?
On Nov 20, 2003, at 14:13, Eric Jain wrote: That's what I had in mind, but maybe there is better way. Once all terms are collected, they can be reassembled into a new document that that can then be indexed again. I see. Assuming I have the relevant terms for a given document, how would a build a new document based on those terms? Something like adding each term's field and text to the new document? What would a term's text hold for an unstored field? TIA. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: moving documents from one index to another?
On Nov 20, 2003, at 14:34, Eric Jain wrote: I believe a term always contains it's own text. (It must be somewhere, after all...) Documents on the other hand may or may not contain the original text, depending on whether a field is stored or not. This seems to be the case: the term's text hold the correct value. Thanks. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: moving documents from one index to another?
On Nov 20, 2003, at 14:34, Eric Jain wrote: I see. Assuming I have the relevant terms for a given document, how would a build a new document based on those terms? Something like adding each term's field and text to the new document? Yes. Ok. Retrieving the term for a document turns out to be pretty straightforward, but building a new document turns out to be slightly more convoluted than expected... I basically need to know which kind of field to create (Stored, Indexed, Tokenized), but this information doesn't seem to be available in the document I'm trying to clone. I thought I could use the original Document's getField() method to retrieve this information, but aside from the Keyword field, none of the other fields are available... where can I get this info at this stage? Here is the problematic method for cloning a document: private Document cloneDocumentWithTerms(final Document aDocument, final Collection someTerms) { if ( aDocument != null ) { if ( someTerms != null ) { Document anotherDocument = new Document(); anotherDocument.setBoost( aDocument.getBoost() ); for ( Iterator anIterator = someTerms.iterator(); anIterator.hasNext(); ) { TermaTerm = (Term) anIterator.next(); String aKey = aTerm.field(); String aValue = aTerm.text(); Field aField = aDocument.getField( aKey ); boolean isStored = aField.isStored(); boolean isIndexed = aField.isIndexed(); boolean isTokenized = aField.isTokenized(); Field anotherField = new Field( aKey, aValue, isStored, isIndexed, isTokenized ); anotherField.setBoost( aField.getBoost() ); anotherDocument.add( anotherField ); } return anotherDocument; } throw new IllegalArgumentException( Index.cloneDocumentWithTerms: null terms. ); } throw new IllegalArgumentException( Index.cloneDocumentWithTerms: null document. ); } The problem is that aDocument.getField( aKey ) returns null most of the time. What gives? TIA. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document ID's and duplicates
On Nov 19, 2003, at 18:14, Don Kaiser wrote: If you do this will the old version of the document be replaced by the new one? No. They will coexist. In Lucene, an update implies a delete/insert sequence. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
On Nov 14, 2003, at 19:50, Chong, Herb wrote: if you are handling inter correlation properly, then terms can't cross sentence boundaries. Could you not break down your document along sentences boundary? If you manage to figure out what a sentence is, that is. if you are not paying attention to sentence boundaries, then you are not following rules of linguistics. Rules of linguistics? Is there such a thing? :) PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector Space Model in Lucene?
On Nov 14, 2003, at 20:27, Dror Matalon wrote: I might be the only person on the list who's having a hard time following this discussion. Nope. I don't understand a word of what those guys are talking about either :) Would one of you wise folks care to point me to a good dummies, also known as an executive summary, resource about the theoretical background of all of this. I understand the basic premise of collecting the words and having pointers to documents and weights, but beyond that ... That's good enough :) PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
On Nov 14, 2003, at 20:29, Philippe Laflamme wrote: Rules of linguistics? Is there such a thing? :) Actually, yes there is. Natural Language Processing (NLP) is a very broad research subject but a lot has come out of it. A lot of what? If statements? :) More specifically, Rule-based taggers have become very popular since Eric Brill published his works on trainable rule-based tagging. Essentially, it comes to down analysing sentences to determine the role (noun, verb, etc.) of each words. It's very helpful to extract noun-phrases such has cardiovascular disease or magnetic resonance imaging from documents. I would agree with that. But it's easier said than done. And the result are never, er, clear cut. So, yep... you can definitely derive rules to analyse natural language... Well... beyond the jargon and the impressive math... this all boils down to fuzzy heuristics and judgment calls... but perhaps this is just me :) I'm sure you already know about all of this... Not really. I'm more of a dilettante than a NLP expert. just thought it might be interesting for some... Sure. But my take on this, is that pigs will fly before NLP turns into a predictable science :) PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector Space Model in Lucene?
On Nov 14, 2003, at 21:16, Chong, Herb wrote: if you know what TREC is, you know what i meant earlier. this isn't exotic technology, this is close to 15 year old technology. This is not really what I asked. What I would be interested to know is what approach you consider to provide the biggest bang for you bucks? PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
On Nov 14, 2003, at 21:14, Philippe Laflamme wrote: Rules of linguistics? Is there such a thing? :) Actually, yes there is. Natural Language Processing (NLP) is a very broad research subject but a lot has come out of it. A lot of what? If statements? :) Yes... just like every software boils down to branching and while loops for the processor... ;o) Hehe... ;) But NLP seems to suffer more from heuristics disguised in fancy jargon than other fields... I would agree with that. But it's easier said than done. Yes, of course this is very complex. That's why NLP is a very popular field of research: it's challenging! Indeed. And the result are never, er, clear cut. You're correct, results are not 100% perfect. But getting 95% is pretty impressive when you're dealing with a computer software. Don't forget, even with many years (decades even) of experience with our own language, we humans still manage to misunderstand certain sentences... can you really expect a software to be 100% correct all the time? Nope. Therefore my tongue in cheek comments... Sure. But my take on this, is that pigs will fly before NLP turns into a predictable science :) Maybe you're right, technologies derived from NLP may never be perfect. But it doesn't make them useless. Quite the contrary I think. Perhaps. I'm not saying it's utterly useless as a whole. But... NLP has a noted tendency to over promise and under deliver. Plus, it's marred with too much jargons which is suspicious in and by itself :) I'm not a Lucene expert, but I'm sure it could benefit from using derived NLP methods for text analysis. For hardcore text analysis, perhaps. But Lucene is an low level indexing library. You can build something much more, er, esoteric on top of it. But I don't think that the core library would benefit from any bizarre additions. Plus, the core elements of the library provide already more than enough room to play with whatever scheme you may have in mind. Maybe someone out there has some experience they might want to share with us? Perhaps. But one way or another, and as far as Lucene is concerned, you will be better off building something exotic on top of Lucene than messing around with its internals. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: fuzzy searches
On Nov 11, 2003, at 21:02, Bruce Ritchie wrote: Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to make sure that any implementation is either blessed by the patent holders or does not infringe on the patents. Since when did developers turn into armchair IP lawyers? Is it a national game? PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Objection to using /tmp for lock files.
On Nov 13, 2003, at 19:00, Dror Matalon wrote: I've been experimenting with it and it seems to work as advertised. It has the advantage of not requiring *any* write capability in /tmp or anywhere else. There is a system property to turn off the lock files altogether. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Filters on term A in query A AND (B OR C OR D)
On Nov 13, 2003, at 22:32, Jie Yang wrote: I am trying to optimse the 500 OR terms so that it does not do a full 2 millions docs search but on the 1000 returned. Would it be beneficial to move the first result set into its own (transient) index to perform the second part of your query? PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Overview to Lucene
Hi Ralf, On Nov 12, 2003, at 14:06, [EMAIL PROTECTED] wrote: Does anybody know good articles which demonstrate parts of that or give a good start into Lucene? Otis Gospodnetic's articles are a good starting point: Introduction to Text Indexing with Apache Jakarta Lucene http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html Advanced Text Indexing with Lucene http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 16:05, Marcel Stör wrote: As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? This mostly means finding document which are similar in some way(s). The similitude is mostly in the eyes of the beholder. In such a world, a cluster would be a pile of document sharing something. As far as Lucene goes, a straightforward way of approaching this could be to use an entire document content to query an index. Lucene's result set could be construed as a document cluster. Admittedly, this is ground zero of document clustering, but here you go anyway :) Here is an illustration: Patterns in Unstructured Data Discovery, Aggregation, and Visualization http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 16:58, Tate Avery wrote: Categorization typically assigns documents to a node in a pre-defined taxonomy. For clustering, however, the categorization 'structure' is emergent... i.e. the clusters (which are analogous to taxonomy nodes) are created dynamically based on the content of the documents at hand. Another way to look at it is this: An attempt to apply the Dewey Decimal system to an orgy. [1] Without a Dewey Decimal system that is. Cheers, PA. [1] http://www.eod.com/devil/archive/semantic_web.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 21:32, maurits van wijland wrote: There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ Leo Galambos, author of the Egothor project, constantly supports us with fresh ideas and includes Carrot components in his own project! http://www.cs.put.poznan.pl/dweiss/carrot/xml/authors.xml?lang=en Small world :) PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The best way forward
On Nov 04, 2003, at 13:04, Otis Gospodnetic wrote: Eventually i am going to try to implement something similar to google groups, indexing lots of NNTP traffic. Has anyone done this before with lucune? Not that I know, but people have used Lucene to index their email, which is somewhat similar. Very similar indeed :) Perhaps you should take a look at ZOE: http://zoe.nu/ It uses Lucene quiet extensively to index emails type of things. NNTP support could be a stone throw away as you would only need to plugin the appropriate JavaMail's Store to handle NNTP specifics. On the other hand, I doubt that anyone has tried to index anything on the scale of Google's data set... NNTP or not :) Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relational Search
On Nov 04, 2003, at 19:28, Tate Avery wrote: Does anyone have any creative ideas for tackling this problem with Lucene? Perhaps Not sure if this quiet what you are after, but you could take a look at ZOE's SZObject framework. It's build on top of Lucene to provide lightweight ODBMS like functionality. Cheers, PA. -- http://zoe.nu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The best way forward
Hi Dror, On Nov 04, 2003, at 19:33, Dror Matalon wrote: By the way, we're also thinking of integrating newsgroups into RSS aggregator which you can see at www.fastbuzz.com. ZOE does something similar already. It can vend messages as RSS feeds: http://zoe.nu/itstories/story.php?data=storiesnum=43sec=2 And also aggregate RSS feeds: http://zoe.nu/itstories/story.php?data=storiesnum=67sec=2 Are you interested in comparing notes, or possibly pooling resources? Who? ZOE? Perhaps. You should drop by its mailing list: https://lists.sourceforge.net/lists/listinfo/zoe-develop https://lists.sourceforge.net/lists/listinfo/zoe-general Archives available here: http://news.gmane.org/gmane.mail.zoe.devel/ http://news.gmane.org/gmane.mail.zoe.general/ We have plenty of technical resources, and we've run news servers before, although it's been a few years. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term out of order.
On Oct 30, 2003, at 13:36, Pasha Bizhan wrote: I think that it's problem of java version of Lucene. Because all core algorithms of Lucene and Lucene.Net are identical. Talking of which... it appears... that... something... is... wrong... somewhere... This definitely needs some additional investigation on my side as I'm quiet at loss about this sudden exception and I cannot reproduce it myself... sigh... Trace: java.io.IOException: term out of order at org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:103) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java: 249) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java: 225) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java: 188) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:425) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:301) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:316) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Exotic format indexing?
Hello, Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a popular question on this list... The traditional approach seems to be to try to find some kind of format specific reader to properly extract the textual part of such documents for indexing. The drawback of such an approach is that its complicated and cumborsome: many different formats, not that many Java libraries to understand them all. An alternative to such a mess could be perhaps to convert those multitude of formats into something more or less standard and then extract the text from that. But again, this doesn't seem to be such a straightforward proposition. For example, one could image printing every document to PDF and then convert the resulting PDF to text. Not a piece of cake in Java. Finally, a while back, somebody on this list mentioned quiet a different approach: simply read the raw binary document and go fishing for what looks like text. I would like to try that :) Does anyone remember this proposal? Has anyone tried such an approach? Thanks for any pointers. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 182 file formats for lucene!!! was: Re: Exotic format indexing?
Hi Stefan, On Oct 30, 2003, at 21:02, Stefan Groschupf wrote: just to let you know, i had implement for the nutch project a plugin that can parse 182 file formats including m$ office. I simply use open office and use the available java api. Yes, I saw that. Great work :) Unfortunately, using OpenOffice is not an option in my case :( Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exotic format indexing?
On Oct 30, 2003, at 20:48, Ben Litchfield wrote: Unfortunately, it is not quite so easy. I am not sure about Word documents The raw text is visible. but PDFs usually have there contents compressed Yep. PDF is really an image format ;) so a raw fishing around for text would be pointless. That's alright. I can handle PDF separately if the need arise. Your best bet is to use a package like the one from textmining.org that handles various formats for you. Perhaps. But I'm only looking for a good enough solution, not a perfect one :) Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.nio.channels.FileLock
On Oct 29, 2003, at 19:08, Ronald Muller wrote: What is the advantage of using a FileLock object instead of the way Lucene does it? (I do not see it) Less code. Less worries. Also note an mportant limitation: File locks are held on behalf of the entire Java virtual machine. They are not suitable for controlling access to a file by multiple threads within the same virtual machine. Perhaps. Have you used it? Any practical experience with it? For or against? Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Weird NPE in RAMInputStream when merging indices
Hi Otis, On Wednesday, Oct 22, 2003, at 18:06 Europe/Amsterdam, Otis Gospodnetic wrote: Since 'files' is a Hashtable, neither the key nor the value (file) can be null, even though the NPE in RAMInputStream constructor implies that file was null. Yep... pretty weird... but looking at openFile(String name)... could it somehow be possible that the name is invalid for some reasons and therefore doesn't exists in the Hashtable? So files.get(name) would return null and new RAMInputStream(file) would then raise a NPE? This would not explain why the name is invalid in the first place... but that could be a start for an investigation... what do you think? Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new release: 1.3 RC2
Hello, On Wednesday, Oct 22, 2003, at 18:13 Europe/Amsterdam, Doug Cutting wrote: A new Lucene release is available. Very nice. Thanks :) Quick question regarding release note number 11: What's the difference between IndexWriter.addIndexes(IndexReader[]) and IndexWriter.addIndexes(Directory[]) beside the fact that one takes an array of IndexReader and the other an array of Directory? Any functional differences? Is one way recommended over the other? Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Weird NPE in RAMInputStream when merging indices
Hello, What could cause such weird exception? RAMInputStream.init: java.lang.NullPointerException java.lang.NullPointerException at org.apache.lucene.store.RAMInputStream.init(RAMDirectory.java:217) at org.apache.lucene.store.RAMDirectory.openFile(RAMDirectory.java:182) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:78) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:116) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:378) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:298) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:313) I don't know if this is a one off as I cannot reproduce this problem nor I have seen this before, but I thought I could as well ask. This is triggered by merging a RAMDirectory into a FSDirectory. Looking at the RAMDirectory source code, this exception seems to indicate that the file argument to the RAMInputStream constructor is null... how could that ever happen? Here is the code which triggers this weirdness: this.writer().addIndexes( new Directory[] { aRamDirectory } ); The RAM writer is checked before invoking this code to make sure there is some content in the RAM directory: aRamWriter.docCount() 0 This has been working very reliably since the dawn of time, so I'm a little bit at loss as how to diagnose this weird exception... Any ideas? Thanks. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[OT] Open Source Goes to COMDEX
Hello, This is pretty much off topic, but... ZOE has been nominated as one of the candidate project to go the Open Source Innovation Area on the COMDEX Exhibit Floor. http://www.oreillynet.com/contest/comdex/ ZOE is one of the few Java project short listed and it uses Lucene quiet extensively. Show your support by voting for ZOE :) Cheers, PA. -- http://zoe.nu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index locked for write
[Posted to Dev by mistake] [Reposted to User] [Sorry for the mess] Hello, I recently updated from 1.3 RC1 to the latest cvs version. RC1 has proven very reliable for me, but I needed Dmitry compound index functionality. Therefore the move to the cvs version. I have been using 1.3 RC1 without any problem. But... since updating to the cvs version, I'm getting a lot of apparently random IOException related to locking: java.io.IOException: Index locked for write: Lock@/tmp/lucene-5b228139f8fe55f7c74441a7d59f8f89-write.lock at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:150) This is most likely due to some problem on my side, but for the life of me I cannot track it down nor reproduce it :( Also, the only change related to Lucene on my side was the update from 1.3 RC1 to the cvs version. Perhaps this has triggered a dormant bug in my app. Or perhaps something has changed in the cvs version which impact me negatively. Other way, I'm at loss. My guess would be that this is most likely a threading issue. On my side, I use a very conservative threading which supposedly synchronized any access to Lucene. And this hasn't changed for a good while. Any idea where I should look in such a situation? Any significant changes related to locking on Lucene side? For the record, this problem seems to mostly manifest itself under Mac OS X, running Java 1.4.1_01. Thanks. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which lock belong to which index?
Hi Otis, On Thursday, Oct 2, 2003, at 13:56 Europe/Amsterdam, Otis Gospodnetic wrote: I cannot remember the answer I got, but I asked the same question after the code was changed to put locks in java.io.tmpdir. Because I have an application that deals with a lot of indices simultaneously, I felt like this will make things more difficult in cases where you have stale locks, etc. Try the archive, though, as I seem to recall that somebody, Doug or Scott gave me the answer. I see... I'm sure I could get to the lock name and scan the tmp directory for a match... but why such complication in the first place? The only thing I can think of is for application running on read-only media... but in such a case there is no need to a lock in the first place... Cheers, PA. Very confused. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is the lucene index serializable?
Can I send a small lucene index by SOAP/TCP/HTTP/RMI? Is there a way to serialize a Lucene Index? I wan to send it from the Indexer server to the Search Server, and then do a merge operation in the Search Server with the previous index file. Well, what about a very old fashioned way instead? Something like tar.gz.ftp? Not very glamourous, but workable... Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Design question
I, like a lot of other people are new to Lucene. Practical examples are pretty scarce. If you don't mind learning by example, take a look at the Powered by Lucene page. A fair number of those projects are open source. http://jakarta.apache.org/lucene/docs/powered.html PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
Hi Otis, On Thursday, Sep 4, 2003, Otis Gospodnetic wrote: Has anyone written an application that uses Lucene to index Java code, either from the source .java files, or compiled .class files? If you are talking about my ultra secret project Zapata: Coding Mexican Style, then yes ;) But... it uses runtime information to reach its devious ends and is more like a documentation tool than anything else... Anyway, this is how it goes: Given a set of binary jar files it builds an object graph of the bytecode: packages, classes, methods and so on. Complete with interdependencies and other handy informations. The bytecode is also run through a decompiler and pretty printed to normalize the source. Code segments are attached and indexed alongside their owners (class or method). All this fully indexed, searchable and cross referenced. This is built upon the same engine used by ZOE, so the end result is very much along the lines of what ZOE does for email, but for code instead... fun, fun, fun ;) Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
Hi Erik, On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote: - XDoclet could be used to sweep through Java code and build a text/XML file as richly as you'd like from the information there (complete with JavaDoc tags, which Zapata will miss :)), Correct. This happen to be on purpose :) Does XDoclet build an intertwingled object graph of your code along the way? Performing a plain search on a code base is pretty trivial... what seems to be more interesting would be to put that in context. Zapata does something along the line of what MagicHat does for Objective-C: http://homepage.mac.com/petite_abeille/MagicHat/ But from the sound of what Otis is saying this is not what you guys are looking for... back to the pampa then... Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer problem
On Thursday, Sep 4, 2003, at 16:07 Europe/Zurich, Nicolas Maisonneuve wrote: I.B.M can be a host or acronym, so threre is a problem , no ? Perhaps as far as this parser goes... but... in practice... '.M' is not a valid TLD. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexReader.delete(Term)?
Hi Erik, On Wednesday, Aug 27, 2003, Erik Hatcher wrote: What you are doing looks fine to me. I'm sure these are obvious questions, kinda like is your computer plugged in?, but here goes: - How are you determining that the document is still there? With an IndexReader? IndexSearcher? - A freshly created (i.e. after the delete) Index[Searcher|Reader]? - And finally, did you remember to recompile?! :)) (just kidding) Thanks for the moral support :) In any case, hard liquors and coding doesn't always mix well together, so obviously I was shooting myself in the foot... For the record, I'm using a RAMDirectory which then gets flushed into a FSDirectory. Deleting something means checking both the RAM and FS directory. Which is what I do. But... because of the internal caching done by the IndexWriter, a document is not made available straight away... therefore IndexReader.delete(Term) returning zero and me banging my head against the wall... adjusting the order of operations did solve the problem... Which brings a question: is there a way to influence the IndexWriter's internal RAM cache, beside closing or optimizing a writer? Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexReader.delete(Term)?
Hello, This is more a sanity check, than anything else, but... I'm trying to delete a document using IndexReader.delete(Term)... (for the record I'm using 1.3-rc1) The document was created with a Field.Keyword() to uniquely identify it. The document exists, was saved, can be queried, life is good :) But then... when trying to delete the same document later on... IndexReader.delete(Term) returns 0 and the document doesn't get deleted... which is driving me crazy 8} Here is what I'm doing to delete the document: Term aTerm = new Term( aKey, anID ); aReader.delete( aTerm ); aReader.close(); The term looks like the following: szid:3FA7168800F7FDE8ECAA35500A00012D But this doesn't seem to do anything... The document is still there no matter what... I'm sure I'm doing something very wrong, but for the life of me I cannot see what... anything obvious I'm missing? Thanks. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Advanced Text Indexing with Lucene
Another fine article by Otis: http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing Tips and Hints
On Tuesday, Feb 25, 2003, at 11:48 Europe/Zurich, Andrzej Bialecki wrote: This is strange, or at least counter-intuitive - if you buffer larger parts of data in RAM than the standard implementation does, it should definitely be faster... Let's wait and see what Terry comes up with. BTW. how large indexes did you use for testing? A small testing set: around 100 MB. Also, it could be that the indexing process is bound by some other bottleneck, Most definitively. and buffering helps only when searching already existing index. Ooops... forgot to mention that the purpose of my testing was to test searching... I don't mind indexing speed that much... in any case... more generally I wanted to see if a buffered random access file would help in my peculiar situation... but no noticeable differences in my case one way or another... on the other hand... that could be just me as there is much more than straightforward Lucene indexing/searching going on. Let that not discourage you :-) In any case, Lucene itself is pretty speedy overall. The only bottleneck is index merging in my experience. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best HTML Parser !!
On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Lukas Zapletal wrote: I have some good experiences with JTidy. It works like DOM-XML parser and cleans HTML it by the way. I use jtidy also. Both for parsing and clean-up. Works pretty nicely. This is VERY useful, because EVERY HTML have at least ONE error. This rule should be tattooed on every parsers head: out of the laboratory, nothing is compliant. Which render the race to more compliance among the different parsers somewhat ridiculous. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
read past EOF?
Hello, Here is a pretty fatal exception I get from time to time in Lucene... java.io.IOException: read past EOF at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:277) at org.apache.lucene.store.InputStream.readBytes(Unknown Source) at org.apache.lucene.index.SegmentReader.norms(Unknown Source) at org.apache.lucene.index.SegmentReader.norms(Unknown Source) at org.apache.lucene.search.TermQuery.scorer(Unknown Source) at org.apache.lucene.search.BooleanQuery.scorer(Unknown Source) at org.apache.lucene.search.Query.scorer(Unknown Source) at org.apache.lucene.search.IndexSearcher.search(Unknown Source) at org.apache.lucene.search.Hits.getMoreDocs(Unknown Source) at org.apache.lucene.search.Hits.init(Unknown Source) at org.apache.lucene.search.Searcher.search(Unknown Source) at org.apache.lucene.search.Searcher.search(Unknown Source) Any idea what could cause such, er, misbehavior? PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Heuristics on searching HTML Documents ?
On Monday, Dec 30, 2002, at 15:01 Europe/Zurich, Erik Hatcher wrote: If you have control over the HTML, how about marking the navbar pieces with a certain CSS class and then filtering that out from what you index? It seems like that would be a reasonable way to filter it - but this is of course provided its your HTML and not someone elses. Alternatively, if the documents creation is out of your hands, you could try to compute the longest common prefix/suffix of a set of document and discount that from your indexing. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: powered by lucene question
On Friday, Dec 27, 2002, at 18:22 Europe/Zurich, Otis Gospodnetic wrote: It would be nice to make that Lucene image clickable, which should be a piece of cake, since Zoe uses HTML for rendering the UI. Doable? Well... yes. This is how it works in the application itself: you can click on the Lucene logo... The screenshot was simply to give you a preview of what to expect... Am I missing something?-) PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: write.lock file
On Friday, Dec 20, 2002, at 19:48 Europe/Zurich, Doug Cutting wrote: Can you provide a reproducible test case that demonstrates index corruption? I honestly wish I could. Unfortunately, because of the nature of the application (Otis is familiar with it), I never seem to be able to come up with a consistent test case. I might be using Lucene in a very peculiar way (?) which includes a lot of concurrent read/write/delete on multiple indexes/threads. So I haven't managed to can the problem in a nice and tidy batch oriented test case. Sight... In any case, the external symptoms are: bad file descriptor, read past EOF and array out of bound. Next time around, I will capture the full stack trace and forward it to the list if you guys are interested. Cheers, PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
powered by lucene question
Hello, I'm in the process of creating the about page for my app and I was wondering what are the requirements to get included in the Powered by Lucene page? The app is a desktop application... it's not a web site. The only requirement I see is Please include something like the following with your search results: search powered by lucene. My question is: is it good enough to put the above in the about page or does it _has_ to be in the search results page? I'm fine with the first scenario (about page) but I'm very reticent to the second (search page). This reticence has nothing to do with branding or failure to give Lucene credit: it's simply that the search page is already very crowded and I don't think adding anything to it will improve the situation :-| On the contrary :-( Comments? Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
package information?
Hi, Would it be possible for Lucene to provide package informations? Basically all the java.lang.Package attributes... Things like implementation vendor, name, version and so on... This would make it easier to identify which packages/versions are used. Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: package information?
On Friday, Dec 20, 2002, at 21:44 Europe/Zurich, Eric Isakson wrote: I think this info is available via the Manifest that is created during the build. This is cut from the build.xml from the latest CVS... Great! I must have overlooked it somehow. Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: write.lock file
On Tuesday, Dec 17, 2002, at 17:43 Europe/Zurich, Doug Cutting wrote: Index updates are atomic, so it is very unlikely that the index is corrupted, unless the underlying file system itself is corrupted. Ummm... Perhaps in theory... In practice, indexes seems to get corrupted quiet easily in my experience. On the other hand, I seldom get a file system corruption. As always, YMMV. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing in a CBD Environment
On Wednesday, Dec 11, 2002, at 15:21 Europe/Zurich, Cohan, Sean wrote: Is there a better way to provide an acceptable searching mechanism using the relational database engine? Well it depend of what you mean by acceptable... but if you are using Oracle, you should look into Oracle Text: http://otn.oracle.com/products/text/content.html http://www.searchtools.com/tools/oracle-search.html PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing in a CBD Environment
On Wednesday, Dec 11, 2002, at 07:16 Europe/Zurich, Otis Gospodnetic wrote: It uses Lucene as an object store, of sort, I believe, with variuos relations between objects (I did not look at the source, but I suspect it does this based on the functionality it offers). Yep. The basic approach ZOE takes is to create one index per class and index the primary and foreigns key as keywords. It then query the different indexes to simulate a relational storage... Which is all handy, dandy... On the other hand, if you already have a relational database in the first place, there is no reason to go through this circus in the first place... You may want to look at its source. If you are so inclined, you can check the alt.dev.szobject package for more gory details. In particular, SZIndex deals with Lucene directly. You can find the app and its source here: http://guests.evectors.it/zoe/ Cheers, PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing email messages?
On Friday, Dec 6, 2002, at 11:12 Europe/Zurich, Ashley Collins wrote: I'm using Lucene to index MIME messages and have a couple of questions. You should take a look at ZOE as it does all that and more. It's open source and uses Lucene to index every single bits of email. http://guests.evectors.it/zoe/ Cheers, PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Readability score?
On Friday, Nov 22, 2002, at 20:46 Europe/Zurich, petite_abeille wrote: Does anyone have a handy library to compute readability score? Here is an extract from a paper describing the Flesch index and an algorithm to count syllables... Does that make any sense? Thanks. The Flesch index: An easily programmable readability analysis algorithm -- John Talburt ... Each vowel (a, e, i, o, u, y) in a word counts as one syllable subject to the following sub-rules: Ignore final -ES, -ED, -E (except for -LE) Words of three letters or less count as one syllable Consecutive vowels count as one syllable. Although there are many exceptions to these rules, it works in a remarkable number of cases. ... http://portal.acm.org/ citation.cfm?id=10583coll=portaldl=ACMCFID=5876721CFTOKEN=58538732 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Readability score?
Hello, This is slightly off topic but... Does anyone have a handy library to compute readability score? Something like Flesch Reading Ease score Co: http://thibs.menloschool.org/~djwong/docs/wordReadabilityformulas.html Would you like to share?-) Thanks. R. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
using lucene as a lookup table?
Hello, I would like to use Lucene as a kind of lookup table (aka Map): A document would have two fields: - the first field would represent a random lookup key in the form of a Field.Keyword - the second field would be an object id also stored as a Field.Keyword Which sounds fine in theory. Unfortunately it doesn't seem to quiet work in practice: when inserting a new document and trying to look it up straight away I usually don't get any result back for a while. Maybe I'm simply missing something very obvious, but how does one lookup a document that was just inserted in an index? Though? Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: using lucene as a lookup table?
On Friday, Sep 27, 2002, at 13:27 Europe/Zurich, petite_abeille wrote: - the first field would represent a random lookup key in the form of a Field.Keyword Ooops... I should have mention that the key field is stored as Field( aKey, aValue, false, true, false): eg not stored, indexed, not tokenized. It it's basically only indexed as I don't need its value for lookup purpose. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: text format and scoring
Hi Alex, On Saturday, August 3, 2002, at 11:13 , Alex Murzaku wrote: Hi PA! How are things going? Doing all right :-) It's an interesting question but I don't think Lucene (as it is today) could change weights based on semantics (either assigned by formatting tags or maybe looked up in some dictionary like WordNet)... Ummm... I see. Some time ago, Doug sent to this list the formula for the score computation which is: Thanks. The only thing that counts is the frequency of the terms in the document and among documents. A way to influence the final score might be to tweak the real frequencies during indexing with some parameters configured externally. Let's say if the word is underlined then multiply its count by X. This modified TF should influence the final score accordingly. Just a thought... I see. That's what I'm basically doing right now somehow: I index a document multiple time (eg an email could be indexed by subject, first sentence and body content). Then I do multiple searches. And use a ranking comparator to evaluate the result based on how many time I get a specific document plus its Lucene scores and other funky heuristics. Which seems to work ok, but is kind of cumbersome :-( Same deal for finding related document. Lucene is very good for finding similar document, but for related (think cluster ;-), I basically end up doing some term categorization and assign some multiplying factor for each term category. Which then I feed to Lucene to get something more akin to a cluster of document... In any case, I was simply wandering if there was a more straightforward way of doing things. Cheers, PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
text format and scoring
Hello, I was wandering what would be a good way to incorporate text format information in Lucene word/document scoring. For example, when turning HTML into plain text for indexing purpose, a lot of potentially useful information are lost: eg tags like bold, strong and so on could be understood as conveying emphasis information about some words. If somebody took the pain to underline some words, why throw it away? Assuming there is some interesting meaning in a document format/layout, and a way to understand it and weight it, how could one incorporate this information into document scoring? Thanks for any insights :-) PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Lucene for OSX?
Hello, I was wandering if anybody knows of a Lucene port to straight C or Objective C...?!? I need something equivalent to Lucene (but native if possible) on Mac OS X... Thanks for any pointers!-) PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene for OSX?
On Tuesday, July 16, 2002, at 03:41 , Otis Gospodnetic wrote: The only thing that I can think of right now is omseek on sf.net, but that project seems somewhat dead. I think that is in C or C++. Thanks. I also found something called Onix (http://www.lextek.com/onix/) Anybody have any experience with it? And how it compare to Lucene? Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene for OSX?
Hi James, On Tuesday, July 16, 2002, at 03:52 , Brook, James wrote: How about this? I think it's what they use for Sherlock. Apple Information Access Toolkit (AIAT) http://www.devworld.apple.com/dev/aiat/ Well, that's basically the first incarnation of Lucene :-) And in fact I was thinking to use it. However it seems to be missing from the latest osx... If you know otherwise, where is it hidden then? I have an Objective C WebObjects 4.5 application running on Mac OS X Server 1.2 that uses it to directly search the blobs of an OpenBase database. Wow... I briefly used AIAT myself... but as I said, I cannot find it anymore. I believe that it is written in C++, but it can easily be wrapped. Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene for OSX?
On Tuesday, July 16, 2002, at 04:04 , Brook, James wrote: It looks like it's available for FTP download as an 'SDK' on this page http://developer.apple.com/sdk/ I have no idea whether this is up-to-date or compatible with the latest OS X. Thanks. I will take a look into it. Cheers, PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
[OT] Zoe open source
Hello, I'm releasing Zoe under the Apple Public Source License and putting together a SourceForge project to coordinate the future development of Zoe. Our plan is to choose a handful of experienced developers to form the core development team for Zoe. Anyone is free to contribute code which members of the development team will review add back into the codebase. Over time we will invite developers who have demonstrated their interest and abilities to join our core team. We'll keep the mailing lists public and encourage everyone to sign up and throw in any comments they may have. There has been a tremendous amount of interest in having Zoe released as an open source project and we think that this will be the best way to manage all the different voices. Right now we need to know which of you are interested in contributing code to Zoe and how interested you are. If you are interested in being on the core team of developers and helping zoe become the best e-mail client out there, we want to hear from you. Members of the core team will be expected to watch the mailing lists, regularly contribute to the codebase, and review and integrate contributions by other developers. If you would like to be considered as a member of the core team we would like you to e-mail a resume, if you have one, some code snippets, and a note letting us know what you are interested in doing with zoe. i.e is there a particular set of tools you would like to implement, or enhance. Email these to Kate at [EMAIL PROTECTED] She will be helping me out and coordinating the SourceForge project for zoe. Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: [OT] Zoe open source
On Monday, June 3, 2002, at 04:44 PM, Peter Carlson wrote: Good luck with your project. It looks very exciting and refreshing. I haven't tried it yet, but the screen shots look useful and beautiful. Thanks. I hope that you will stay active in the Lucne user community and contribute any new features in Lucene back into the core or sandbox projects. Sure. There is already a kind of generic persistency layer build on top of Lucene. People interested can take a look at the alt.dev.szobject package under Frameworks. Cheers, PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
[OT] An Open Letter
FYI. Begin forwarded message: From: Alex Horovitz [EMAIL PROTECTED] Date: Mon May 27, 2002 01:58:27 PM Europe/Zurich To: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Cc: Steve Jobs [EMAIL PROTECTED], [EMAIL PROTECTED], toni Trujillo-Vian [EMAIL PROTECTED], Bob Fraser [EMAIL PROTECTED], WebObjects [EMAIL PROTECTED] Subject: [OT] An Open Letter An open letter to Apple on why many people want an open source WebObjects and EOF. ---Reader's Digest Version- Four reasons why Apple should open source WO/EOF: REASON #1: WO/EOF cannot be legitimate extensions of the Apple brand, its value to the marketplace is only achieved through independence from the Apple brand proper. Placing WO/EOF under an open source license allows Apple to retain control. It also allows legitimacy to and adoption by those who would not normally accept or adopt an Apple product in this space. REASON #2: Because debugging is highly parallelizable, an open source WO/EOF will increase the number of debuggers and therefor increase the stability of the product over the long run by applying the skills of many more engineers than Apple could ever hope to support as employees. With a large enough user/co-developer community, all bugs can be quickly quantified and understood allowing a fix to become obvious to at least one member of the community. REASON #3: If Apple will treat the WO/EOF user community as if we were their most valuable resource in terms of current and future development of the product, we will become their most valuable resource. Trusting us enough to share the source in an open source fashion, will benefit Apple (and the application server market) in ways they cannot even begin to imagine. REASON #4: Because there is no accounting for taste. That was the first lesson of applied microeconomics my college professor taught me, and it holds true today. Apple, as smart and cutting edge as it may be, cannot anticipate the ways in which WO/EOF will be utilized or improved upon by people in the field. Open source allows for faster innovation and the ability to capture truly useful and novel ideas. -Unabridged Version-- The first question Apple must address is one of business sense. Does it make good business sense to open source any technology, let alone WO/EOF. We have some evidence that at least in one case, Darwin, it made sense to open source a key Apple technology. Now granted, this is an attempt to position Mac OS X against Linux in some key market segments. That being said, a case was effectively made and bought off on by key Apple people. Can we do the same for WebObjects? Sure we can. WebObjects is not a clear legitimate extension of the Apple brand. I suspect that everyone knows this to be true. I also suspect that this gives Apple some pause in terms of being able to evangelize/market the product at the level which would allow it to attain a respectable position in the application server market. And, before you say that a large company like Apple can't really afford to open source a software project like WO/EOF, consider that IBM has done it for WebSphere. An open sourced WO/EOF could avoid the traditional problems Apple faces in the area of brand extension. This is because Apple Engineering enjoys legitimacy as an outstanding software organization. As technologies, WO/EOF both enjoy reputations for being excellent products. However, in terms of adoption, they suffer due to the disconnect between the enterprise application server market and Apple's traditional self branding. REASON #1: WO/EOF cannot be legitimate extensions of the Apple brand, its value to the marketplace is only achieved through independence from the Apple brand proper. Placing WO/EOF under an open source license allows Apple to retain control. It also allows legitimacy to and adoption by those who would not normally accept or adopt an Apple product in this space. From experience, we all know there has _never_ been a bug free release of WO/EOF. Apple's WO/EOF customers face the same challenge in this respect: given the new release, what bugs will it have that will prevent me from moving to that release; and, what bugs in my current release does it fix that would encourage me to move to that release. Also from experience we know there to be a significant time in between releases. The non-open source development style is the culprit here. Apple is passionate about release good stable software. Before a product can go out the door there is an extensive amount of QA and testing. This being the case, and with a goal of minimizing shipping bugs and maximizing stability of releases, it takes time to get to a point where collectively Apple feels it can ship the product. The experience of the open source community is quite the opposite.
source code available
For entertainment purpose only, ZOË's source code is available at: http://guests.evectors.it/zoe/ PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing PDF files
On Wednesday, May 1, 2002, at 05:41 PM, Otis Gospodnetic wrote: Wouldn't you want to convert to XML instead and use XSLT to transform the XML representation to any desired format by just applying a style sheet? Sounds like less work with bigger document type coverage. Sounds good... But what does it mean? I'm not that familiar with any of the XML, XSLT hype so I don't really understand what you are getting at... I just want to convert any type of document to text for indexing purpose... I'm not planning to do anything else with it... However, converting everything to PDF as a first step allow you to provide a preview of any documents even if you happen not to understand the original format (eg MS Office)... PA -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing PDF files
On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote: Can I assume none of the poeple on the lucene user group had implemented indexing a pdf document using lucene. Who knows...?!? In any case, it's not public knowledge... If some one has.. Please help me by providing the solution. I use to believe in Santa Claus also... ;-) All that said, there seems to be a real demand to do something about pdf to text conversion (in java preferably). I'm willing to invest some time and brain cell to nail it down, but I'm note sure where to start... I'm aware of the PJ library, but it's really a pig as far as resources goes. Anything else? Any (concrete) pointer appreciated. Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)
On Wednesday, May 1, 2002, at 12:41 AM, Dmitry Serebrennikov wrote: - the number of files that Lucene uses depends on the number of segments in the index and the number of *stored* fields - if your fields are not stored but only indexed, they do not require separate files. Otherwise, an .fnn file is created for each field. Ok. That's good as all my fields are indexed but not stored in Lucene. Only one field is stored in any one index: the uuid of an object (as a Keyword). - if at least one document uses a given field name in an index, that index requires the .fnn file for that field Ok. So, in theory, more homogeneous index should use less files all things being equal? - index segments are created when documents are added to the index. For each 10 docs you get a new segment. - optimizing the index removes all segments are replaces them with one new segment that contains all of the documents - optimization is done periodically as more documents are added (controlled by IndexWriter.mergeFactor), but can be done manually whenever needed Ok. When doing the optimization, are there any temporary files getting created? With all this, I think Lucene does use too many files... That's my impression also... Some additional info: there is a field on IndexWriter called infoStream. If this is set to a PrintStream (such as System.out), various diagnostic messages about the merging process will be printed to that stream. Yep. I guess I overlooked that. You might find this helpful in tuning the merge parameters. Just to make sure: using a small merge factor (eg 2) will reduce the number of files or just optimize (aka merge) the index more often? Hope this helps. Good luck. Thanks. Very helpful indeed :-) R. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing PDF files
On Tuesday, April 30, 2002, at 10:46 PM, Otis Gospodnetic wrote: Hm, this should be a FAQ. Maybe it should... ;-) Check Lucene contributions page, there are some starting points there, Well, this seems to be a very popular request... In fact I need something like that also. Unfortunately, there seems to be no authoritative answer as far as converting pdf files to text in a pure Java environment... Maybe I'm missing something here as usual? Also, on a related note, what would be a good approach to convert any random document into pdf? I was thinking to have a two steps process for document indexing in Lucene: - First, convert everything to pdf (with Acrobat or something) - Second, convert pdf to text and index it. Any practical suggestions about how to do that in a pure Java environment very welcome. Thanks :-) PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)
On Tuesday, April 30, 2002, at 01:57 AM, Steven J. Owens wrote: Just be glad you aren't doing this on Solaris with JDK 1.1.6 I know... In fact I'm looking forward to port my stuff to 1.4... As my app is very much IO bond I'm really excited by this nio madness... :-) Yes and no. Setting ulimit to a reasonable number of open files is not only not a patch, it's the right way to do it. Of course... Nothing is really black or white... What I wanted to say is that -as a first strike- *I* prefer not to mess around with system parameters. I understand where you're coming from, really, and in a certain way, it makes sense Thanks. I already feel less alone... ;-) BUT... sometimes the impulse for clean, good design takes you too far down a blind alley. Sure. At the end of the day, everything is a tradeoff... Sometimes there is no elegant solution. Sometimes there is no best way, only one of a limited set of options with different tradeoffs. Absolutely. Most serious applications have to have some sort of OS variable tweaking, you're just used to having it done invisibly and painlessly. Agree. In fact that's my first desktop application for nearly a decade. I usually work on large scale system. And let me tell you, it's a very different pair of sleeves... ;-) You could figure out the right way to set the system configuration on install or launch. One of my design goal is to try to avoid these sort of tweakings as much as I can. You could look at the alternative techniques for indexing in Lucene That's another one of those nasty tradeoffs... ;-) Memory is even more precious than file descriptors in my situation... Specially with a jvm that have this funky notion of constraining your memory usage... if there's anything you're doing wrong (perhaps opening files and not closing them, and leaving them for the garbage collector to eventually get around to closing?) Sure. I went through all those sanity checks. Also, in my case, the garbage collector is my friend as I'm using the java.lang.ref API extensively. or if you have a pessimal usage pattern that exacerbates the situation. U...?!? You lost me here... What's a pessimal usage pattern? if you can come up with a scheme to run Lucene indexing with modified code for keeping track of file resources. Sure, there are many thing that one could do... However, I have to balance how much time I want to invest into any one of those allays. One thing I really like about Lucene is it very simple API and usage. So far it has worked out pretty well for me as I'm using it pretty extensively. And I seem to have found -at last- a good balance between the different constrains I'm operating under. an anomalous situation (use on a client/desktop machine) Anomalous situation?!?! Ummm... Lucene is just an API... Hopefully it's not bundled with some dogma attached to it... However, I'm kind of starting to wander about that considering some of the -very defensive- responses I got to my postings... Oh, well... I will just go back to my cave... :-( could configure lucene to be careful about how many files it keeps open at any given time. That will be great! On a somewhat related note, I have decided to stick with the com.lucene package for the time being I was pretty excited when the rc stuff came out, but it just didn't work out for me. My resources problem just went from bad to worse. And also, I have two issues with the release candidate: locking and reference counting. Locking. I don't have anything against locking per see. However, I really don't like how it's implemented in the rc. Using files just do not work for me. It creates too many problems when something goes wrong (eg the app is killed without warning and I have to clean up all those locks by myself). What about using sockets or something to rendez-vous on an index? Or at a bare minimum, be able to disable the locking all together. I understand that most people are using Lucene under a very different setup that I do, but nevertheless it should not hurt to make it configurable. Anyway, it does not work for older jvm as noted in the source code. Last, but not least, I'm always get very scared when I see some platform dependent code somewhere (eg if version 1 then ) ;-) Reference counting. Well, as noted in a comment in the source code, the reference API is really the way to go... And trying to be backward compatible to version 0.9 is somehow missing the forest for the tree... Just my two cents in any case. And yes, I'm well aware that I can fix all these issue by myself... And start to contribute to Lucene instead of just ranting left and right... But also keep in mind that I'm just a humble Lucene user. And there seem to be a very clear distinction between user and developer in Lucene's world... ;-) Thanks for your response in any case. I hope I didn't offend too many people with my ramblings ;-) PA. -- To
Re: rc4 and FileNotFoundException: an update
I don't know what environment you're using Lucene in. However, we had this too many open files problem on our Solaris box, and increasing the number of file descriptors through the ulimit -n command fixed it. Thanks. That should help. However, I have a little desktop app and it will be very cumbersome to require users to change some system parameters just to run it... :-( Thanks in any case. PA -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: too many open files in system
On Tuesday, 9. April 2002 14:08, you wrote: root wrote: Doesn't Lucene releases the filehandles?? because I get too many open files in system after running lucene a while! Are you closing the readers and writers after you've finished using them? cheers, Chris Yes I close the readers and writers! By the way, did you ever solved this problem? I want through that thread and everybody seem to be passing the buck to somebody else... :-( PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: FileNotFoundException: code example
I would add some logging to the code You lost me here... Where should I add some logging? to get more idea of which Lucene methods are actually being called, when, in what sequence. I typical sequence looks like that: - search() - deleteIndexWithID() - indexValuesWithID() PA -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: too many open files in system
how many open files you think can be used at your process?? Not sure. It varies with usage pattern. I will check it out in any case. cat /proc/sys/fs/file-max cat: /proc/sys/fs/file-max: No such file or directory echo 5 /proc/sys/fs/file-max Unfortunately, I cannot use this kind of quick fix as my app is a desktop app and can access the user account only. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]