Re: Multiple indexes
Ben, You do need to use a separate instance of those 3 classes for each index yes. But this is really something like: IndexWriter writer = new IndexWriter(); So it's normal code-writing process you don't really have to create anything new, just use existing Lucene API. As for locking, again you don't need to create anything. Lucene does have a locking mechanism, but most of it should be completely invisible to you if you follow the concurrency rules. I hope this helps. Otis --- Ben [EMAIL PROTECTED] wrote: Is it true that for each index I have to create a seperate instance for FSDirectory, IndexWriter and IndexReader? Do I need to create a seperate locking mechanism as well? I have already implemented a program using just one index. Thanks, Ben On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: It's hard to answer such a general question with anything very precise, so sorry if this doesn't hit the mark. Come back with more details and we'll gladly assist though. First, certainly do not copy/paste code. Use standard reuse practices, perhaps the same program can build the two different indexes if passed different parameters, or share code between two different programs as a JAR. What specifically are the issues you're encountering? Erik On Mar 1, 2005, at 8:06 PM, Ben wrote: Hi My site has two types of documents with different structure. I would like to create an index for each type of document. What is the best way to implement this? I have been trying to implement this but found out that 90% of the code is the same. In Lucene in Action book, there is a case study on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ranking Terms
Make sure you are not indexing your documents using the compound index format (default in the newer versions of Lucene). Then you will see the .frq file. Here is an example from one of Simpy's Lucene indices: -rw-r--r--1 simpysimpy 629073 Feb 26 13:14 _1ao.frq Otis -- http://www.simpy.com --- Daniel Cortes [EMAIL PROTECTED] wrote: Hi everybody, I need to found some documentation about the algorithms that lucene use internally in the indexation and how it works with weights and frequencies of the terms.This information will be used to know tastes of my users and to relate users with the same interest and restlessness.:D I read something about .frq files but I don't have any frq life in my index. Thks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person from a well-known online bookseller told me recently that Lucene was about 10x faster that MySQL for full-text searching, and I am currently helping someone get away from MySQL and into Lucene for performance reasons. Otis --- Steven J. Owens [EMAIL PROTECTED] wrote: Hi, I was rambling to some friends about an idea to build a cache-aware JDBC driver wrapper, to make it easier to keep a lucene index of a database up to date. They asked me a question that I have to take seriously, which is that most RDBMSes provide some built-in fulltext searching - postgres, mysql, even oracle - why not use that instead of adding another layer of caching? I have to take this question seriously, especially since it reminds me a lot of what Doug has often said to folks contemplating doing similar things (caching query results, etc) with Lucene. Has anybody done some serious investigation into this, and could summarize the pros and cons? -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - http://darksleep.com/notablog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Performance
Or you could just open a new IndexSearcher, forget the old one, and have GC collect it when everyone is done with it. Otis --- Chris Lamprecht [EMAIL PROTECTED] wrote: I should have mentioned, the reason for not doing this the obvious, simple way (just close the Searcher and reopen it if a new version is available) is because some threads could be in the middle of iterating through the search Hits. If you close the Searcher they get a Bad file descriptor IOException. As I found out the hard way :) On Fri, 18 Feb 2005 15:03:29 -0600, Chris Lamprecht [EMAIL PROTECTED] wrote: I recently dealt with the issue of re-using a Searcher with an index that changes often. I wrote a class that allows my searching classes to check out a lucene Searcher, perform a search, and then return the Searcher. It's similar to a database connection pool, except that - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document comparison
Matt, Erik and I have some code for this in Lucene in Action, but David Spencer did this since the book was published: http://www.lucenebook.com/blog/announcements/more_like_this.html Otis --- Matt Chaput [EMAIL PROTECTED] wrote: Is there a simple, efficient way to compute similarity of documents indexed with Lucene? My first, naive idea is to use the entire contents of one document as a query to the second document, and use the score as a similarity measurement. But I think I'm probably way off base with that. Can any IR pros set me straight? Thanks very much. Matt -- Matt Chaput Word Monkey Side Effects Software Inc. A goddamned ray of sunshine all the goddamned time -- Sparkle Hayter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Performance
Yes, until it's cleaned up, and as soon as the last client is done with Hits, the originating IndexSearcher is ready for cleanup if nobody else is holding references to it. You can close it explicityly, as you are doing, too, no harm. Otis --- Chris Lamprecht [EMAIL PROTECTED] wrote: Wouldn't this leave open file handles? I had a problem where there were lots of open file handles for deleted index files, because the old searchers were not being closed. On Fri, 18 Feb 2005 13:41:37 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: Or you could just open a new IndexSearcher, forget the old one, and have GC collect it when everyone is done with it. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Concurrent searching re-indexing
Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www.lucenebook.com/search?query=concurrent IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. Otis --- Paul Mellor [EMAIL PROTECTED] wrote: Hi, I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? I have a class which encapsulates all access to my index, so that writes can be synchronized. This class also exposes a method to obtain an IndexSearcher for the index. I'm running unit tests to test this which create many threads - each thread does a complete re-indexing and then obtains an IndexSearcher and does a query. I'm finding that with sufficiently high numbers of threads, I'm getting the occasional failure, with the following exception thrown when attempting to construct a new IndexWriter (during the reindexing) - java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151) ... The exception occurs quite infrequently (usually for somewhere between 1-5% of the Threads). Does the IndexSearcher take a 'snapshot' of the index at creation? Or does it access the filesystem whilst searching? I am also synchronizing creation of the IndexSearcher with the write lock, so that the IndexSearcher is not created whilst the index is being recreated (and vice versa). But do I need to ensure that the IndexSearcher cannot search whilst the index is being recreated as well? Note that a similar unit test where the threads update the index (rather than recreate it from scratch) works fine, as expected. This is running on Windows 2000. Any help would be much appreciated! Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official business of this company are those solely of the author and should not be interpreted as being endorsed by this company. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does [] do to a query and what's up with lucene.apache.org?
Hi, lucene.apache.org seems to work now. Here is the query syntax: http://lucene.apache.org/queryparsersyntax.html [] is used as [BEGIN-RANGE-STRING TO END-RANGE-STRING] Otis --- Jim Lynch [EMAIL PROTECTED] wrote: First I'm getting a The requested URL could not be retrieved While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered is at line 1, column 15. Was expecting: ] ... when I tried to parse the following string [this is a test]. I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: behavioral differences between Field.Keyword and Field.UnStored
The QueryParser is analyzing your Field.Keyword (genre field) fields, because it doesn't know that genre is a Keyword field and should not be analyzed. Check section 4.4. here: http://www.lucenebook.com/search?query=queryparser+keyword Otis --- Mike Rose [EMAIL PROTECTED] wrote: Perhaps someone can explain something that seems to be a little weird to me. I seem to be unable to search on fields of type Keyword. The following snippet returns no hits.. IndexWriter index = new IndexWriter(indexPath, new StandardAnalyzer(), true); Document doc = null; doc = new Document(); doc.add(Field.Text(artist, Butthole Surfers)); doc.add(Field.Keyword(genre, Punk)); doc.add(Field.Text(album, Rembrandt Pussyhorse)); index.addDocument(doc); doc = new Document(); doc.add(Field.Text(artist, Ornette Coleman)); doc.add(Field.Keyword(genre, Jazz)); doc.add(Field.Text(album, Tomorrow is the Question)); index.addDocument(doc); index.optimize(); index.close(); Searcher searcher = new IndexSearcher(indexPath); String expression = genre:punk; Query query = QueryParser.parse(expression, artist, new StandardAnalyzer()); Hits hits = searcher.search(query); for (int i = 0; i hits.length(); i++) { System.out.println(hits.doc(i)); } searcher.close(); However, if I change the genre field to be defined as Field.Text or Field.UnStored, I get the result I expect. Can anyone offer any insight? Mike ATTACHMENT part 2 application/x-pkcs7-signature name=smime.p7s - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimize not deleting all files
Get and try Lucene 1.4.3. One of the older versions had a bug that was not deleting old index files. Otis --- [EMAIL PROTECTED] wrote: Hi, When I run an optimize in our production environment, old index are left in the directory and are not deleted. My understanding is that an optimize will create new index files and all existing index files should be deleted. Is this correct? We are running Lucene 1.4.2 on Windows. Any help is appreciated. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numbers in the Query String
Using different analyzers for indexing and searching is not recommended. Your numbers are not even in the index because you are using StandardAnalyzer. Use Luke to look at your index. Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Hello, How can one search for a document based on the query which has numbers in the query srting. e.g. query = Java 2 Platform J2EE What do I need to do so that the numbers do not get neglected. I am using StandardAnalyzer to index the pages and using StopAnalyzer to search the documents. Would the use of two different analyzers cause any trouble for the results? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea [EMAIL PROTECTED] wrote: Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: carrot2 question too - Re: Fun with the Wikipedia
Adam, Dawid posted some code that lets you use Carrot2 locally with Lucene, without the componentized pipe line system described on Carrot2 site. Otis --- Adam Saltiel [EMAIL PROTECTED] wrote: David, Hi, Would you be able to comment on coincidentally recent thread RE: - Grouping Search Results by Clustering Snippets:? Also, when I looked at Carrot2 the pipe line is implemented as over http. I wonder how efficient that is, or can it be changed, for instance for an all local implementation? Has Carrot2 been integrated in with Lucene, has it been used as the bases for a recommender system (could it be?)? TIA. Adam -Original Message- From: Dawid Weiss [mailto:[EMAIL PROTECTED] Sent: Monday, January 31, 2005 4:12 PM To: Lucene Users List Subject: Re: carrot2 question too - Re: Fun with the Wikipedia Hi. Coming up with answers... a little belated, but hope you're still on: we have been experimenting with carrot2 and are very pleased so far, only one issue: there is no release not even an alpha one and the dependencies seemed to be patched (jama) Yes, there is not official release. We just don't feel the need to tag the sources with an official label because Carrot is not a stand-alone product (rather a library... or a framework). It does not imply that the project is in alpha stage... quite the contrary, in fact -- it has been out there for a while and it seems to do a good job for most people. is there any intentions to have any releases in the near future? I could tag a release even today if it makes you happy ;) But I hope I made the status of the project clear above. D. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: total number of (unique) terms in the index
I don't think there is a direct way to get the number of (unique) terms in the index, so yes, I think you'll have to loop through TermEnum and count. Otis --- Jonathan Lasko [EMAIL PROTECTED] wrote: I'm looking for the total number of unique terms in the index. I see that I can get a TermEnum of all the terms in the index, but what is the fastest way to get the total number of terms? Jonathan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Loading a large index
Edwin, --- Edwin Tang [EMAIL PROTECTED] wrote: I have three indices really that I search via ParallelMultiSearcher. All three are being updated constantly. We would like to be able to perform a search on the indices and have the results reflect the latest documents indexed. However, that would mean I need to refresh my searcher. Because of the size of these, it's taking some time to load, and so search speed from the end user perspective seems slow. What can I do to minimize or do away with the time it takes to loading a new searcher... from the end user perspective that is? How up-to-date do these searches have to be? If they don't have to be exactly up to date you could periodically re-create the IndexSearcher, instead of checking for a ne ndex version on every search. I think person from Moreover.com posted some code that may be relevant. Maybe 3-4 months ago, maybe 6...it had to do with re-reading the index in the background for sorting purposes, if I recall correctly. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize
Morus, that description of 3 sets of index files is what I was imagining, too. I'll have to test and add to the book errata, it seems. Thanks for the info, Otis --- Morus Walter [EMAIL PROTECTED] wrote: Otis Gospodnetic writes: Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus optimizing it. see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space However, three times the space sounds a bit too much, or I make a mistake in the book. :) I cannot explain why, but ~ three times the size of the final index is what I observed, when I logged disk usage during optimize of an index in compound index format. The test was on linux, I simply did a 'du -s' every few seconds parallel to the optimize. I didn't test noncompund format. Probably optimizing a compund format requires to store the different parts of the compound file separately before joining them to the compound file (sound reasonable, otherwise you would need to know the sizes before creating the parts). In that case you had the original index, the separate files and the new compound file as the disk usage peak. So IMHO the book is wrong. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action hits desk in UK
Hello, I've asked the publisher ( http://www.manning.com ) yesterday. I don't know about the exact stores, but apparently they do have a distributor in Singapore, so you should be able to find Lucene in Action there soon. Otis --- jac jac [EMAIL PROTECTED] wrote: Just wondering: Is Lucene-in-Action being sold anywhere in Singapore? thanks! Otis Gospodnetic [EMAIL PROTECTED] wrote: Gospodnetiæ sounds like Gospodnetich and Eric is Erik :) Otis --- John Haxby wrote: Otis Gospodnetic wrote: I contacted both the US and UK Amazon sites and asked them to fix my last name (the last character in my name has a little slash (not an accent) above it), but they never bothered to fix it nor email me back. They probably don't know how to type a Ä. How _do_ you pronounce your name? I've no idea what to do with that mark over the final c! At the moment it's Lucene in Action by Eric Hatcher and Otis Gospo-something-or-other :-) Anyhow, enjoy Lucene in Action! Already doing so! jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Different Documents (with fields) in one index?
Karl, This is completely fine. You can have documents with different fields in the same index. Otis --- Karl Koch [EMAIL PROTECTED] wrote: Hello all, perhaps not such a sophisticated question: I would like to have a very diverse set of documents in one index. Depending on the inside of text documents, I would like to put part of the text in different fields. This means in the searches, when searching a particular field, some of those documents won't be addressed at all. Is it possible to have different kinds of Documents with different index fields in ONE index? Or do I need one index for each set? Karl -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting Questions
Luke, Boosting is only one of the factors involved in Document/Query scoring. Assuming that by applying your boosts to Document A or a single field of Document A increases the total score enough, yes, that Document A may have the highest score. But just because you boost a single Document and not others, it does not mean it will emerge at the top. You should check out the Explanation class, which can dump all scoring factors in text or HTML format. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: Hi All; I just want to make sure I have the right idea about boosting. So if I boost a document (Document A) after I index it (lets say a score of 2.0) Lucene will now consider this document relativly more important than other documents in the index with a boost factor less than 2.0. This boost factor will also be applied to all the fields in the Document A. Therefore, if I do a TermQuery on a field that all my documents share (title), in the returned Hits (assuming Document A was among the return documents), Document A will score higher than other documents with a lower boost factor because the title field in A would have been boosted with all its other fields. Correct? Now if at indexing time I decided to boost a particular field, lets say address in Document A (this is a field which all documents have) the boost factor is only applied to the address field of Document A. Nothing else is boosted by this operation. This means if a TermQuery on the address field returns Document A along with a collection of other documents, Document A will score higher than the others because of boosting. Correct? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML index
Hello Karl, Grab the source code for Lucene in Action, it's got code that parses and indexes XML with DOM and SAX. You can see the coverage of that stuff here: http://lucenebook.com/search?query=indexing+XML+section%3A7* I haven't used kXML, but I imagine the LIA code should get you going quickly and you are free to adapt the code to work with kXML for you. Otis --- Karl Koch [EMAIL PROTECTED] wrote: Hi, I want to use kXML with Lucene to index XML files. I think it is possible to dynamically assign Node names as Document fields and Node texts as Text (after using an Analyser). I have seen some XML indexing in the Sandbox. Is anybody here which has done something with a thin pull parser (perhaps even kXML)? Does anybody know of a project or some sourcecode available which covers this topic? Karl -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize
Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus optimizing it. see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space However, three times the space sounds a bit too much, or I make a mistake in the book. :) You said you end up with 3 files - .cfs is one of them, right? Otis --- Kauler, Leto S [EMAIL PROTECTED] wrote: Just a quick question: after writing an index and then calling optimize(), is it normal for the index to expand to about three times the size before finally compressing? In our case the optimise grinds the disk, expanding the index into many files of about 145MB total, before compressing down to three files of about 47MB total. That must be a lot of disk activity for the people with multi-gigabyte indexes! Regards, Leto CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
I discuss this with myself a lot inside my head... :) Seriously, I agree with Erik. I think this is a business opportunity. How many people are hating me now and going shh? Raise your hands! Otis --- David Spencer [EMAIL PROTECTED] wrote: This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to configure the thing and to customize the LF of the search results. jian chen wrote: Hi, I was searching using google and just found that there was a new feature called google mini. Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The nice feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disk space used by optimize
Have you tried using the multifile index format? Now I wonder if there is actually a difference in disk space cosumed by optimize() when you use multifile and compound index format... Otis --- Kauler, Leto S [EMAIL PROTECTED] wrote: Our copy of LIA is in the mail ;) Yes the final three files are: the .cfs (46.8MB), deletable (4 bytes), and segments (29 bytes). --Leto -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus optimizing it. see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space However, three times the space sounds a bit too much, or I make a mistake in the book. :) You said you end up with 3 files - .cfs is one of them, right? Otis --- Kauler, Leto S [EMAIL PROTECTED] wrote: Just a quick question: after writing an index and then calling optimize(), is it normal for the index to expand to about three times the size before finally compressing? In our case the optimise grinds the disk, expanding the index into many files of about 145MB total, before compressing down to three files of about 47MB total. That must be a lot of disk activity for the people with multi-gigabyte indexes! Regards, Leto CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: google mini? who needs it when Lucene is there
500 times the original data? Not true! :) Otis --- Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote: Hi, I agree that Google mini is quite expensive. It might be similar to the desktop version in quality. Anyone knows google's ratio of index to text? Is it true that Lucene's index is about 500 times the original text size (not including image size)? I don't have one installed, so I cannot measure. Best, Sharon jian chen [EMAIL PROTECTED] wrote: Hi, I was searching using google and just found that there was a new feature called google mini. Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The nice feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action hits desk in UK
Publisher - Amazon information feed seems to be a fairly manual process, and Amazon takes a while to update book information on their site, including prices. I contacted both the US and UK Amazon sites and asked them to fix my last name (the last character in my name has a little slash (not an accent) above it), but they never bothered to fix it nor email me back. Anyhow, enjoy Lucene in Action! Otis Gospodneti#263; --- John Haxby [EMAIL PROTECTED] wrote: My copy of Lucene in Action has finally hit my desk in the UK. Hopefully the dispatch time quoted by amazon.co.uk will now start to drop to something more sensible. It's been interesting watching the price changes. When I ordered my copy back in November, I paid £19.38 for it. At around the time of publication, the price went up to £35.99, the list price. It's currently priced at £25.19, 30% off list price. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting Into Search
Hi Luke, That's not hard with RangeQuery (supported by QueryParser), take a look at this: http://www.lucenebook.com/search?query=date+range The grayed-out text has the section name and page number, so you can quickly locate this stuff in your ebook. Otis P.S. Do you know if Indigo/Chapters has Lucene in Action on their book shelves yet? --- Luke Shannon [EMAIL PROTECTED] wrote: Hello; My lucene application has been performing well in our company's CMS application. The plan now is too offer advanced searching. I just bought the eBook version of Lucene in Action to help with my research (it is taking Amazon for ever to ship the printed version to Canada). The book looks great and will certainly deepen my understanding. But I am suffering a bit of information over load. I was hoping I could post the rough requirments I was given this morning and perhaps some more experienced Luceners could help direct my research (this can even be pointing me to relevant sections of the book). 1. Documents in the system contain the following fields, ModificationDate, CreationDate. A query is required that allows users to search for documents created/modified on a certain date or within a certain date range. 2. Documents in the system also contains fields: Title, Path. A query is required that allows users to search for Titles or Path starting with, ending with, containing (this is all the system currently does) or matching specific term(s). Later today I will get more specific requirments. For now I am looking through Analysis section of the eBook for ideas on how to handle this. Any tips anyone can give would be appreciated. Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action hits desk in UK
Gospodneti#263; sounds like Gospodnetich and Eric is Erik :) Otis --- John Haxby [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: I contacted both the US and UK Amazon sites and asked them to fix my last name (the last character in my name has a little slash (not an accent) above it), but they never bothered to fix it nor email me back. They probably don't know how to type a ć. How _do_ you pronounce your name? I've no idea what to do with that mark over the final c! At the moment it's Lucene in Action by Eric Hatcher and Otis Gospo-something-or-other :-) Anyhow, enjoy Lucene in Action! Already doing so! jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search on heterogenous index
Hello Simeon, Heterogenous Documents/indices are OK - check out the second hit: http://www.lucenebook.com/search?query=heterogenous+different Otis --- Simeon Koptelov [EMAIL PROTECTED] wrote: Hello all. I'm new to lucene and think about using it in my project. I have prices with dynamic structure, containing wares there, about 10K prices with total 500K wares. Each price has about 5 text fields. I'll do searches on wares. The difficult part is that I'll do searches for all wares, the search is not bound to a particular price structure. My question is, how should I organize my indices? Can Lucene handle data effectlively if I'll have one index containing different Fields in Documents? Or should I create a separate index for each price with same Fields structure across Documents? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I don't have a document with chinese characters to verify this, but it looks right, so I'll add your change to SearchFiles.java. Thanks, Otis --- Eric Chow [EMAIL PROTECTED] wrote: Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: 經 Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the 經 - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
That would be a partial solution. Accents will not be a problem any more, but if you use an Analyzer than stems tokens, they will not rally be tokenized properly. Searches will probably work, but if you look at the index you will see that some terms were not analyzed properly. But it may be sufficient for your needs, so try just with accent removal. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Morus Walter said the following on 1/21/2005 2:14 AM: No. You could do a ( ( french-query ) or ( english-query ) ) construct using one query. So query construction would be a bit more complex but querying itself wouldn't change. The first thing I'd do in your case would be to look at the differences in the output of english and french snowball stemmer. I don't speak any french, but probably you might even use both stemmers on all texts. Morus I've done some thinking afterwards, and instead of messing with complex queries, would it make sense to replace all special characters such as é, è with e during indexing (I suppose write a custom analyzer) and then during searching parse the query and replace all occurances of special characters (if any) with their normal latin equivalents? This should produce the required results, no? Since the index would not contain any French characters and searching for French words would return them since they were indexed as normal words. -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: keep indexes as files or save them in database
A number of people have tried putting Lucene indices in RDBMS. As far as I know, all were slower than FSDirectory. Otis --- nafise hassani [EMAIL PROTECTED] wrote: Hi I want to know from the performance point of view it is better to save lucene indexes in database or use them as files??? suggestion?? best regards __ Do you Yahoo!? All your favorites on one personal page Try My Yahoo! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( opening... ); long before = System.currentTimeMillis(); Directory dir = FSDirectory.getDirectory( /var/ksa/index-1078106952160/, false ); IndexReader ir = IndexReader.open( dir ); System.out.println( ir.getClass() ); long after = System.currentTimeMillis(); System.out.println( opening...done - duration: + (after-before) ); System.out.println( totalMemory: + Runtime.getRuntime().totalMemory() ); System.out.println( freeMemory: + Runtime.getRuntime().freeMemory() ); Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action
Hi Ansi, If you want the print version, I would guess you could order it from the publisher (http://www.manning.com/hatcher2) or from Amazon and they will ship it to you in China. The electronic version (a PDF file) is also available from the above URL. I'll ask Manning Publications and see whether they ship outside the U.S. Otis --- ansi [EMAIL PROTECTED] wrote: hi,all Does anyone know how to buy Lucene in Action in China? Ansi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Yes, I remember your email about the large number of Terms. If it can be avoided and you figure out how to do it, I'd love to patch something. :) Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. I loaded it into a profiler a long time ago. Most of the code was due to Term classes being loaded into memory. I might try to get some time to load it into a profiler on monday... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming
Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering w/ Multiple Terms
This: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html ? You can control that limit via http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount Otis --- Jerry Jalenak [EMAIL PROTECTED] wrote: OK. But isn't there a limit on the number of BooleanQueries that can be combined with AND / OR / etc? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, January 20, 2005 5:05 PM To: Lucene Users List Subject: Re: Filtering w/ Multiple Terms On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote: In looking at the examples for filtering of hits, it looks like I can only specify a single term; i.e. Filter f = new QueryFilter(new TermQuery(new Term(acct, acct1))); I need to specify more than one term in my filter. Short of using something like ChainFilter, how are others handling this? You can make as complex of a Query as you want for QueryFilter. If you want to filter on multiple terms, construct a BooleanQuery with nested TermQuery's, either in an AND or OR fashion. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion needed for extranet search
Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- Ranjan K. Baisak [EMAIL PROTECTED] wrote: I am planning to move to Lucene but not have much knowledge on the same. The search engine which I had developed is searching some extranet URLs e.g. codeguru.com/index.html. Is is possible to get the same functionality using Lucene. So basically can I make Lucene as a search engine to search extranets. regards, Ranjan __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Concurrent read and write
Hello Ashley, You can read/search while modifying the index, but you have to ensure only one thread or only one process is modifying an index at any given time. Both IndexReader and IndexWriter can be used to modify an index. The former to delete Documents and the latter to add them. You have to ensure these two operations don't overlap. c.f. http://www.lucenebook.com/search?query=concurrent Otis --- Ashley Steigerwalt [EMAIL PROTECTED] wrote: I am a little fuzzy on the thread-safeness of Lucene, or maybe just java. From what I understand, and correct me if I'm wrong, Lucene takes care of concurrency issues and it is ok to run a query while writing to an index. My question is, does this still hold true if the reader and writer are being executed as separate programs? I have a cron job that will update the index periodically. I also have a search application on a web form. Is this going to cause trouble if someone runs a query while the indexer is updating? Ashley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: å´ Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the å´ - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: help in indexing
Hello Chetan, The code that comes with the Lucene book contains a little framework for indexing rich-text documents. It sounds like you may be able to use it as-is, and extending it with a parser for Excel files, which we didn't include in the code (whould we include it in the next edition?). While PDFBox comes with that handy Lucene-specific class that you are using, it may be better for you to be in control of how exactly you construct your Lucene documents. c.f. http://www.lucenebook.com/search?query=framework Otis --- chetan minajagi [EMAIL PROTECTED] wrote: Hi Karthik/Cocula, Luke didn't work but Limo helped.I seem to get results when i use Limo for my text/xls files. Now the problem with pdf search The problem that i see is the summary field as seen through LIMO is not indexed and hence no hits. I'm using the default document got by LucenePDFDocument.getDocument(myPdfFile); So how do i ensure that a few of the fields in this which are not indexed are set to indexed. As far as I can see I can only probe whether a field is indexed or not by using Field.isIndexed() but is there a method by which i can set to indexed. can someone provide any help or pointers in this regard? Thanks Regards, Chetan Karthik N S [EMAIL PROTECTED] wrote: Hi Probably u need to use the Luke S/w to peek insid tu'r Indexer,Use it then come back for more help Karthik -Original Message- From: chetan minajagi [mailto:[EMAIL PROTECTED] Sent: Thursday, January 20, 2005 12:05 PM To: lucene-user@jakarta.apache.org Subject: help in indexing Hi , It might seem elementary to most of you. I am trying to build a search tool for internal use using lucene. I have used the following for .pdf -- PDFBOx .html -- demo file of lucene(HTMLDocument) .xls -- poi The indexing seems to work without throwing up any errors. But,when i try to search i end up getting with zero hits always. I have tried to use the same string that i see (System.out.print(Document)) but in vain. Can somebody let me know where and what could be wrong. Regards, Chetan - Do you Yahoo!? Yahoo! Search presents - Jib Jab's 'Second Term' - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Mail - You care about security. So do we. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene2.0 and transaction support
The Wiki has some info about Lucene 2.0, but that is all there is about 2.0. Regarding transactions - have you tried DbDirectory? I believe that will provide XA support and it won't require Lucene changes. Otis --- John Wang [EMAIL PROTECTED] wrote: Hi: When is lucene 2.0 scheduled to be released? Is there a javadoc somewhere so we can check out the new APIs? Is there a plan to add transaction support into lucene? This is something we need and if we do implement it ourselves, is it too large of a change for a patch? Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Closed IndexWriter reuse
No, you can't add documents to an index once you close the IndexWriter. You can re-open the IndexWriter and add more documents, of course. Otis --- Oscar Picasso [EMAIL PROTECTED] wrote: Hi, Is it safe to add documents to an IndexWriter that has been closed? From what I have seen, the close method flush the changes, closes the files but it creates new files allowing to add new documents. Am I right? Thanks. __ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why IndexReader.lastModified(index) is depricated?
Going for the segments file like that is not a recommended practise, or at least not something I'd recommend. 'segments' file is really something that a caller should not know anything about. Once day Lucene may choose to rename the segments file or some such, and the code that uses this trick will break. To answer the original question, yes, I think it would be handy to have this method back. Perhaps we should revive it/them, ha? Otis --- Chris Hostetter [EMAIL PROTECTED] wrote: : Why IndexReader.lastModified(index) is depricated? Did you read the javadocs? Synchronization of IndexReader and IndexWriter instances is no longer done via time stamps of the segments file since the time resolution depends on the hardware platform. Instead, a version number is maintained within the segments file, which is incremented everytime when the index is changed. : It's always a good idea to know when the index changed last time, for That's a good point, and you can still get that information using the same underlying method IndexReader.lastModified did/does... directory.fileModified(segments); ...it's just no longer crucial that IndexReader have that information. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I unlock?
I didn't pay full attention to this thread, but it sounds like somebody may be interested in RuntimeShutdownHook (or some similar name) as a place to try to release the locks. Otis --- Joseph Ottinger [EMAIL PROTECTED] wrote: On Tue, 11 Jan 2005, Doug Cutting wrote: Joseph Ottinger wrote: As one for whom the question's come up recently, I'd say that locks need to be terminated gracefully, instead. I've noticed a number of cases where the locks get abandoned in exceptional conditions, which is almost exactly what you don't want. The problem is that this is hard to do from Java. A typical approach is to put the process id in the lock file, then, if that process is dead, ignore the lock file. But Java does not let one know process ids. Java 1.4 provides a LockFile mechanism which should mostly solve this, but Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that feature. Lucene 2.0 is likely to require Java 1.4 and should be able to do a better job of automatically unlocking indexes when processes die. Agreed - but while there are some situations in which releasing locks is difficult (i.e., JVM catastrophic shutdown), there are others in which attempts could be made via finally blocks, etc. --- Joseph B. Ottinger http://enigmastation.com IT Consultant [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I unlock?
Eh, that exactly :) When I read my emails in reverse order --- Chris Lamprecht [EMAIL PROTECTED] wrote: What about a shutdown hook? Runtime.getRuntime().addShutdownHook(new Thread() { public void run() { /* whatever */ } }); see also http://www.onjava.com/pub/a/onjava/2003/03/26/shutdownhook.html On Tue, 11 Jan 2005 13:21:42 -0800, Doug Cutting [EMAIL PROTECTED] wrote: Joseph Ottinger wrote: As one for whom the question's come up recently, I'd say that locks need to be terminated gracefully, instead. I've noticed a number of cases where the locks get abandoned in exceptional conditions, which is almost exactly what you don't want. The problem is that this is hard to do from Java. A typical approach is to put the process id in the lock file, then, if that process is dead, ignore the lock file. But Java does not let one know process ids. Java 1.4 provides a LockFile mechanism which should mostly solve this, but Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that feature. Lucene 2.0 is likely to require Java 1.4 and should be able to do a better job of automatically unlocking indexes when processes die. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance question
Use one index, working with a single index is simpler. Also, once you pull a Document from Hits object, all Fields are read off of the disk. There was some discussion about selective Field reading about a week ago, check the list archives. Also keep in mind Field compression is now possible (only with unreleased version in CVS). Otis --- Crump, Michael [EMAIL PROTECTED] wrote: Hello, If I have large text fields that are rarely retrieved but need to be searched often - Is it better to create 2 indices, one for searching and one for retrieval, or just one index and put everything in it? Or are there other recommendations? Regards, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Id
Hello, If you search for India OR Test, you will find both, if you use AND, you will find none. Lucene can search any text, not just files. It sounds like you are using Lucene's demo as a real application (not a good practise). I suggest you take a look at the Resources page on the Lucene Wiki to get a better idea about what Lucene is and how it can be used. Otis --- mahaveer jain [EMAIL PROTECTED] wrote: Hi, I have a application where I know I will have duplicate ID's. When I search these duplicate ID's will it search content in both the files ? For Example : Id = Mahaveer, Content = Jain India Id = Mahaveer, Content = Lucene Test Now when I search for India Test will it return both the columns ? Also can I display unique results ? Mahaveer __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: reading fields selectively
Hi John, There is no API for this, but I recall somebody talking about adding support for this a few months back. I even think that somebody might have contributed a patch for this. I am not certain about this, but check the patch queue (link on Lucene site). If there is a patch there, even if the patch no longer applies cleanly, you'll be able to borrow the code for your own patch. Also note that the CVS version has support for field compression, which should help with performance if you are working with large fields. Otis --- John Wang [EMAIL PROTECTED] wrote: Hi: Is there some way to read only 1 field value from an index given a docID? From the current API, in order to get a field from given a docID, I would call: IndexSearcher.document(docID) which in turn reads in all fields from the disk. Here is my problem: After the search, I have a set of docIDs. For each document, I have a unique string identifier. At this point I only need these identifiers but with the above API, I am forced to read the entire row of fields for each document in the search result, which in my case can be very large. Is there an alternative? I am thinking more on the lines of a call: Field[] getFields(int docID,String fieldName); Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Book in UK
The book is $44.95 USD - it's printed on the back cover. Amazon had the correct price (minus their discount) until recently. They are just very slow with their site/book info updates, but I'm sure they'll fix it eventually. Otis --- Erik Hatcher [EMAIL PROTECTED] wrote: On Jan 6, 2005, at 3:49 PM, Chris Hostetter wrote: BN agrees that the list price is $60.95 ... which may be what Manning is citing to resellers. This is incorrect information that has somehow gotten out. Amazon and BN are slow to update their information, but Manning assures me that they have provided the correct information to Amazon to update. The actual price you're paying is certainly not indicative of a $60.95 list price - Amazon doesn't discount 50%, I'm sure. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RemoteSearcher
Nutch (nutch.org) has a pretty sophisticated infrastructure for distributed searching, but it doesn't use RemoteSearcher. Otis --- Yura Smolsky [EMAIL PROTECTED] wrote: Hello. Does anyone know application which based on RemoteSearcher to distribute index on many servers? Yura Smolsky, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing issue
That's the correct place to look and it includes code samples. Yes, it's a Jar file that you add to the CLASSPATH and use ... hm, normally programmatically, yes :). Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Has any one used NekoHTML ? If so how do I use it. Is it a stand alone jar file that I include in my classpath and start using just like IndexHTML ? Can some one share syntax and or code if it is supposed to be used programetically. I am looking at http://www.apache.org/~andyc/neko/doc/html/ for more information is that the correct place to look? Thanks, -H Erik Hatcher wrote: Sure... clean up your HTML and it'll parse fine :) Perhaps use JTidy to clean up the HTML. Or switch to using a more forgiving parser like NekoHTML. Erik On Jan 4, 2005, at 3:59 PM, Hetan Shah wrote: Hello All, Does any one know how to handle the following parsing error? thanks for pointers/code snippets. -H While trying to parse a HTML file using IndexHTML I get Parse Aborted: Encountered \ at line 8, column 1162. Was expecting one of: ArgName ... = ... TagEnd ... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Help for sorting
Hello, --- mahaveer jain [EMAIL PROTECTED] wrote: I am looking out to implement sorting in my lucene application. This is what my code look like. I am using StandardAnalyzer() analyzer. Query query = QueryParser.parse(keyword, contents, analyzer); Sort sortCol = new Sort(new SortField(date)); // date is one of the field I have indexed. Hits hits = searcher.search(query, sortCol); for (int start = 0; start hits.length(); start ++) { Document doc = hits.doc(start); // get all the data required. } I get this error : no terms in field sdate - cannot determine sort type Is it possible that your 'date' field is empty in some documents you indexed? If so, you should specify your sort field type explicitly. Look at the Javadoc for SortField class. Can any let me know where I am wrong ? Also what is the default sorting in lucene ? Default sorting is by rank/score. Also can some one explain what exactly is the score ? Is it something to do with ranking ? Do somebody have a link to a good lucene tutorial ? There are links to a few Lucene articles on Lucene's Wiki. There is also a link to the Lucene book (Lucene in Action) on the same page. Another good source of information about how to use the Lucene API are Lucene's unit tests. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how often to optimize?
Correct. The self-maintenance you are referring to is Lucene's periodic segment merging. The frequency of that can be controlled through IndexWriter's mergeFactor. Otis --- aurora [EMAIL PROTECTED] wrote: Are not optimized indices causing you any problems (e.g. slow searches, high number of open file handles)? If no, then you don't even need to optimize until those issues become... issues. OK I have changed the process to not doing optimize() at all. So far so good. The number of files hover from 10 to 40 during the indexing of 10,000 files. Seems Lucene is doing some kind of self maintenance to keep things in order. Is it right to say optimize() is a totally optional operation? I probably get the impression it is a natural step to end an incremental update from the IndexHTML example. Since it replicates the whole index it might be an overkill for many applications to do daily. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need an analyzer that includes numbers.
WhitespaceAnalyzer will let you have it. It just breaks the input on spaces. Otis --- Jim [EMAIL PROTECTED] wrote: I've seen some discussion on this and the answer seems to be write your own. Hasn't someone already done that by now that would share? I really have to be able to include numeric and alphanumeric strings in my searches. I don't understand analyzers well enough to roll my own. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar
Most definitely Jetty. I can't believe you're using Tomcat for Rojo! ;) Otis --- Erik Hatcher [EMAIL PROTECTED] wrote: Wrong list. Though perhaps you should be using Jetty ;) Erik On Dec 23, 2004, at 4:17 PM, Kevin A. Burton wrote: What in the world is up with this exception? We've migrated to using pre-compiled JSPs in Tomcat 5.5 for performance reasons but if I try to start with a FRESH webapp or try to update any of the JSPs and in-place and recompile I'll get this error: Any idea? I thought maybe the .jar files were corrupt but if I md5sum them they are identical to production and the Tomcat standard dist. Thoughts? org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1) /init.jsp(2,0) Unable to read TLD META-INF/c.tld from JAR file file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/ standard.jar: org.apache.jasper.JasperException: Failed to load or instantiate TagLibraryValidator class: org.apache.taglibs.standard.tlv.JstlCoreTLV org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHan dler.java:39) org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.jav a:405) org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.jav a:86) org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java: 339) org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java: 372) org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475) org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539) org.apache.jasper.compiler.Parser.parse(Parser.java:126) org.apache.jasper.compiler.ParserController.doParse(ParserController.ja va:211) org.apache.jasper.compiler.ParserController.parse(ParserController.java :100) org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146) org.apache.jasper.compiler.Compiler.compile(Compiler.java:286) org.apache.jasper.compiler.Compiler.compile(Compiler.java:267) org.apache.jasper.compiler.Compiler.compile(Compiler.java:255) org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.j ava:556) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j ava:296) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java: 295) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: retrieve tokens
Martijn, have you seen the Highlighter in the Lucene Sandbox? If you've stored your text in the Lucene index, there is no need to go back to DB to pull out the blog, parse it, and highlight it - the Highlighter in the Sandbox will do this for you. Otis --- M. Smit [EMAIL PROTECTED] wrote: Hello list, I'm not sure if this subject will cover my question, but here goes: consider the following snippet: is = new IndexSearcher((String) envContext.lookup(search_index_dir)); StopAnalyzer analyzer = new StopAnalyzer(ArticleIndexer.SEARCH_STOP_WORDS_NL); parser = new NewMultiFieldQueryParser(ArticleIndexer.FIELDS_SEARCH_BASIC, analyzer); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); query = parser.parse(searchForm.getCriteria()); hits = is.search(query); log.info([execute] aantal Lucene hits: + hits.length()); Perfect.. And when I present the results, I retrieve the original document from the database through it guid which I get from the doc.get(ArticleIndexer.FIELD_GUID). And besides some businesslogic I have to take care of when I retrieve the original document, I would also like to give a context snippet. So I've written a class which takes care of this context 'snippeting and highlighting' (perhaps somebody knows about a great project which I haven't found last week while hunting for it). But I need to have the original query.. And preferable the words assiociated with the fields in (String[]) ArticleIndexer.FIELDS_SEARCH_BASIC. Because every field correspond with a different text-blob in my DB, so I have to know which BufferedReader I have to parse for the associated words.. Thank you for your time, Martijn - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: (Offtopic) The unicode name for a character
If you are not tied to Java, see 'unac' at http://www.senga.org/. It's old, but if nothing else you could see how it works and rewrite it in Java. And if you can, you can donate it to Lucene Sandbox. Otis --- Peter Pimley [EMAIL PROTECTED] wrote: Hi everyone, The Question: In Java generally, Is there an easy way to get the unicode name of a character? (e.g. LATIN SMALL LETTER A from 'a') The Reasoning (for those who are interested): The documents I'm indexing have quite a lot of characters that are basically variations on the basic A-Z ones. In my analysis step, I'd like to convert these to their closest equivalent in the basic A-Z set. For some letters, this is easy. An example is the e-acute character (00E9 LATIN SMALL LETTER E WITH ACUTE). I'd like to turn that into plain 'e'. I can do that by using the IBM ICU4J tools to decompose the single character into two; 'e' and 0301 COMBINING ACUTE ACCENT. Then I can strip all characters that fail Character.isLetterOrDigit. That works fine. Some characters however do not decompose. An example is the character 01A4 LATIN CAPITAL LETTER P WITH HOOK. I'd like to replace that with 'P', but it does not decompose into P + something. I'm considering taking the unicode name for each character I encounter and regexping it against something like: ^LATIN .* LETTER (.) WITH .*$ ... to try and extract the single A-Z|a-z character. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: retrieve tokens
I suspect Martijn really wants that snippet dynamically generated, with KWIC, as on the lucenebook.com screen shot. Thus, he can't generate and store the snippet at index time, and has to construct it at search time. Otis --- Mike Snare [EMAIL PROTECTED] wrote: But for the other issue on 'store lucene' vs 'store db'. Does anyone can provide me with some field experience on size? The system I'm developing will provide searching through some 2000 pdf's, say some 200 pages each. I feed the plain text into Lucene on a Field.UnStored bases. I also store this plain text in the database for the sole purpose of presenting a context snippet. Why not store the snippet in another field that is stored, but not indexed? You could then immediately retrieve the snippet from the doc. This would only increase your index by num_docs * size_snippet and would save the db access time and complexity. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: retrieve tokens
For simpy.com I store the full text of web pages in Lucene, in order to provide full-text web searches. Nutch (nutch.org) does the same. You can set the maximal number of tokens you want indexed via IndexWriter. You can also compress fields in the newest version of Lucene (or maybe just the one in CVS), which may help you if you are considered about disk space, although I wouldn't want to have to uncompress each hit's 200 pages worth of text in order to create a summary with KWIC. :) Oh, and you asked about highlighter and field/query matching. I _think_ it won't help you with that, but I'm a bit behind on the highlighter, so you should check the version in CVS and see if it's capable of this. Otis --- M. Smit [EMAIL PROTECTED] wrote: Erik Hatcher wrote: Highlighter does not mandate you store your text in the index. It is just a convenient way to do it. You're free to pull the text from anywhere and highlight it based on the query. Furthermore, you are saying that the highlighter takes care of the corresponding field/words for me and pull up a context snippet? Ouch, why haven't I stumpled upon the sandbox See a screenshot of it here: http://www.lucenebook.com (going live within a week!) Oh bliss, Oh joy.. This is exactly what I'm looking for... I'll plunge in to it and let you know! But for the other issue on 'store lucene' vs 'store db'. Does anyone can provide me with some field experience on size? The system I'm developing will provide searching through some 2000 pdf's, say some 200 pages each. I feed the plain text into Lucene on a Field.UnStored bases. I also store this plain text in the database for the sole purpose of presenting a context snippet. If I were to use the Highlighter with a Field.Text, I will not use the database plain part altogether. But still I'm a little worried about speed/space issues. Or am I just seeing bears-on-the-road (Dutch saying, in plain English: making a fuzz about nothing).. Martijn - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: addIndexes() Question
I _think_ you'd be better off doing it all at once, but I wouldn't trust myself on this and would instead construct a small 3-index set and test, looking at a) maximal disk usage, b) time, and c) RAM usage. :) Otis --- Ryan Aslett [EMAIL PROTECTED] wrote: Hi there, Im about to embark on a Lucene project of massive scale (between 500 million and 2 billion documents). I am currently working on parallellizing the construction of the Index(es). Rough summary of my plan: I have many, many physical machines, each with multiple processors that I wish to dedicate to the construction of a single index. I plan on having each machine gather its documents from a central sychronized source (network, JMS, whatever). Within each machine I will have multiple threads each responsible for construcing an index slice. When all machines and all threads are finished, I should have a slew of index slices that I want to combine together to create one index. My question is this: Will it be more efficient to call addIndexes(Directory[] dirs) on all the slices all at once? Or might it be better to continually merge small indexes into a larger index, i.e. once an index slice reaches a particular size, merge it into the main index and start building a new slice... Any help would be appreciated.. Ryan Aslett - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing speed. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how often to optimize?
Hello, I think some of these questions my be answered in the jGuru FAQ So my question is would it be an overkill to optimize everyday? Only if lots of documents are being added/deleted, and you end up with a lot of index segments. Is there any guideline on how often to optimize? Every 1000 documents or more? Are not optimized indices causing you any problems (e.g. slow searches, high number of open file handles)? If no, then you don't even need to optimize until those issues become... issues. Every week? Is there any concern if there are a lot of documents added without optimizing? Possibly, see my answer above. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: analyzer effecting phrases?
When searching for phrases, what's important is the position of each token/word extracted by the Analyzer. WhitespaceAnalyzer/LowerCaseFilter don't do anything with the positional information. There is nothing else in your Analyzer? In any case, the following should help you see what your Analyzer is doing: http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can augment the code there to provide positional information, too. Otis --- Peter Posselt Vestergaard [EMAIL PROTECTED] wrote: Hi I am building an index of texts, each related to a unique id. The unique ids might contain a number of underscores which will make the standardanalyzer shorten them after it sees the second underscore in a row. Furthermore many of the texts I am indexing is in Italian so the removal of 'trivial' words done by the standard analyzer is not necessarily meaningful for these texts. Therefore I am instead using an analyzer made from the WhitespaceTokenizer and the LowerCaseFilter. This works fine for me until I try searching for a phrase. I am searching for a simple phrase containing two words and with double-quotes around it. I have found the phrase in one of the texts so I know it should return at least one result, but none is found. If I remove the double-quotes and searches for the 2 words with AND between them I do find the story. Can anyone tell me if this is an obvious (side-)effect of not using the standard analyzer? And is there a better solution to my problem than using the very simple analyzer? Best regards Peter Vestergaard PS: I use the same analyzer for both searching and indexing (of course). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Queries difference
Alex, I think you want this: +city:London +city:Amsterdam +address:1_street +address:2_street Otis --- Alex Kiselevski [EMAIL PROTECTED] wrote: Thanks Morus So if I understand right If the seqond query is : +city(London) +city(Amsterdam) +address(1_street) +address(2_street) Both queries have the same value ? -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 6:11 PM To: Lucene Users List Subject: Re: Queries difference Alex Kiselevski writes: Hello, I want to know is there a difference between queries: +city(+London Amsterdam) +address(1_street 2_street) And +city(+London) +city(Amsterdam) +address(1_street) +address(2_street) I guess you mean city:(... and so on. The first query searches documents containing 'London' in city, scoring results also containing Amsterdam higher, and containing 1_street or 2_street in address. The second query searches for documents containing both London and Amsterdam in city and 1_street and 2_street in address. Note the the + before London in the second query doesn't mean anything. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing with Lucene 1.4.3
The only place where you have to specify that you are using the compound index format is on IndexWriter instance. Nothing needs to be done at search time on IndexSearcher. Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Thanks Chuck, I now understand why I see only one file. Another question is do I have to specify somewhere in my code or some configuration setting that I would now be using a compound file format (.cfs file) for index. I have an application that was working in version 1.3-final till I moved to 1.4.3 now I do not get any results back from my searches. I tried using Luke and it shows me the content of the index. I can search using Luke but no success so far with my own application. Any pointers? Thanks. -H Chuck Williams wrote: That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:00 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space needed for indexing???
The exact disk space usage depends on the number of fields in the index and on how many of them store the original text. You should also keep in mind that the call to IndexWriter's optimize() will result in your index directory size doubling while the optimization is in progress, so if you want to optimize you will need extra free disk space. Otis --- [EMAIL PROTECTED] wrote: Hi, everyone, Does anyone have any idea how much disk space will be needed for generating the final index with ~1.5G size, for example? I have ~3.5G disk space and is able to generate index with 1G size. However, after I add more records, it will run out of disk space. Does Lucene suppose to take so much disk space for indexing? Is there any way that I can improve the code to let it take less space? Thanks, Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
Hello, As Erik already said - that Analyzer is really there to get people going quickly and as a 'does pretty good' Analyzer. There is no Analyzer that will work for everyone, and Analyzers are meant to be custom-made. It looks like you already got that figured out and have your own Analyzer. Otis --- Mike Snare [EMAIL PROTECTED] wrote: Absolutely, but -- correct me if I'm wrong -- it would give no higher ranking to half-baked and would take a good deal longer on large indices. On Thu, 16 Dec 2004 20:03:27 +0100, Daniel Naber [EMAIL PROTECTED] wrote: On Thursday 16 December 2004 13:46, Mike Snare wrote: Maybe for a-b, but what about English words like half-baked? Perhaps that's the difference in thinking, then. I would imagine that you would want to search on half-baked and not half AND baked. A search for half-baked will find both half-baked and half baked (the phrase). The only thing you'll not find if halfbaked. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Hello Homam, The batches I was referring to were batches of DB rows. Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X LIMIT=Y. Don't close IndexWriter - use the single instance. There is no MakeStable()-like method in Lucene, but you can control the number of in-memory Documents, the frequence of segment merges, and maximal size of an index segments with 3 IndexWriter parameters, described fairly verbosely in the javadocs. Since you are using the .Net version, you should really consult dotLucene guy(s). Running under the profiler should also tell you where the time and memory go. Otis --- Homam S.A. [EMAIL PROTECTED] wrote: Thanks Otis! What do you mean by building it in batches? Does it mean I should close the IndexWriter every 1000 rows and reopen it? Does that releases references to the document objects so that they can be garbage-collected? I'm calling optimize() only at the end. I agree that 1500 documents is very small. I'm building the index on a PC with 512 megs, and the indexing process is quickly gobbling up around 400 megs when I index around 1800 documents and the whole machine is grinding to a virtual halt. I'm using the latest DotLucene .NET port, so may be there's a memory leak in it. I have experience with AltaVista search (acquired by FastSearch), and I used to call MakeStable() every 20,000 documents to flush memory structures to disk. There doesn't seem to be an equivalent in Lucene. -- Homam --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) 3) Give the JVM more memory to play with by using -Xms and -Xmx JVM parameters 4) See IndexWriter's minMergeDocs parameter. 5) Are you calling optimize() at some point by any chance? Leave that call for the end. 1500 documents with 30 columns of short String/number values is not a lot. You may be doing something else not Lucene related that's slowing things down. Otis --- Homam S.A. [EMAIL PROTECTED] wrote: I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing a large number of DB records
Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size, because new segments are created as documents are added, and existing segments don't need to be updated (only when merges happen). Again, I'd run your app under a profiler to see where the time and memory are going. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live index) then delete the temporary index. On the next loop I'd only query rows from the db above the id in the maxdoc of the live index and set the max rows of the query to to 10,000 i.e SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] {ID from Index.MaxDoc()} ORDER BY [id_field] ASC Ensuring that the documents go into the index sequentially your problem is solved and memory usage on mine (dotlucene 1.3) is low Regards Garrett -Original Message- From: Homam S.A. [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 02:43 To: Lucene Users List Subject: Indexing a large number of DB records I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score is X. So that is where the absolue value of the score would be useful. I believe Chuck submitted some code that fixes this, which also helps with MultiSearcher, where you have to have this contant score in order to properly order hits from different Searchers, but I didn't dare to touch that code without further studying, for which I didn't have time. Otis --- Doug Cutting [EMAIL PROTECTED] wrote: Chuck Williams wrote: I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query). I question whether such scores are more meaningful. Yes, such scores would be guaranteed to be between zero and one, but would 0.8 really be meaningful? I don't think so. Do you have pointers to research which demonstrates this? E.g., when such a scoring method is used, that thresholding by score is useful across queries? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: finalize delete without optimize
Hello John, Once you make your change locally, use 'cvs diff -u IndexWriter.java indexwriter.patch' to make a patch. Then open a new Bugzilla entry. Finally, attach your patch to that entry. Note that Document deletion is actually done from IndexReader, so your patch may have to be on IndexReader, not IndexWriter. Thanks, Otis --- John Wang [EMAIL PROTECTED] wrote: Hi Otis: Thanks for you reply. I am looking for more of an API call than a tool. e.g. IndexWriter.finalizeDelete() If I implement this, how would I go about submitting a patch? thanks -John On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable if you use some of the 'internal' methods. I don't have the need for this, but others might, so it may be worth developing a tool that purges Documents marked as deleted without the expensive segment merging, iff that is possible. If you put this tool under the approprite org.apache.lucene... package, you'll get access to 'internal' methods, of course. If you end up creating this, we could stick it in the Sandbox, where we should really create a new section for handy command-line tools that manipulate the index. Otis --- John Wang [EMAIL PROTECTED] wrote: Hi: Is there a way to finalize delete, e.g. actually remove them from the segments and make sure the docIDs are contiguous again. The only explicit way to do this is by calling IndexWriter.optmize(). But this call does a lot more (also merges all the segments), hence is very expensive. Is there a way to simply just finalize the deletes without having to merge all the segments? If not, I'd be glad to submit an implementation of this feature if the Lucene devs agree this is useful. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Otis --- Bruce Ritchie [EMAIL PROTECTED] wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Opinions: Using Lucene as a thin database
Well, one could always partition an index, distribute pieces of it horizontally across multiple 'search servers' and use the built-in RMI-based and Parallel search feature. Nutch uses something similar for search scaling. Otis --- Monsur Hossain [EMAIL PROTECTED] wrote: My concern is that this just shifts the scaling issue to Lucene, and I haven't found much info on how to scale Lucene vertically. By vertically, of course, I meant horizontally. Basically scaling it across servers as one might do with a relational database. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: finalize delete without optimize
Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable if you use some of the 'internal' methods. I don't have the need for this, but others might, so it may be worth developing a tool that purges Documents marked as deleted without the expensive segment merging, iff that is possible. If you put this tool under the approprite org.apache.lucene... package, you'll get access to 'internal' methods, of course. If you end up creating this, we could stick it in the Sandbox, where we should really create a new section for handy command-line tools that manipulate the index. Otis --- John Wang [EMAIL PROTECTED] wrote: Hi: Is there a way to finalize delete, e.g. actually remove them from the segments and make sure the docIDs are contiguous again. The only explicit way to do this is by calling IndexWriter.optmize(). But this call does a lot more (also merges all the segments), hence is very expensive. Is there a way to simply just finalize the deletes without having to merge all the segments? If not, I'd be glad to submit an implementation of this feature if the Lucene devs agree this is useful. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Opinions: Using Lucene as a thin database
You can see Flickr-like tag (lookup) system at my Simpy site ( http://www.simpy.com ). It uses Lucene as the backend for lookups, but still uses a RDBMS as the primary storage. I find it that keeping the RDBMS and Lucene indices is a bit of a pain and error prone, so _thin_ storage layer with simple requirements will be okay with just using Lucene, while applications with more complex domain models will quickly run into limitation (using the wrong tool for the job type of problem). Otis --- Monsur Hossain [EMAIL PROTECTED] wrote: I think this is a great idea, and one that I've been mulling over to implement keyword lookups (similar to Flickr.com's tag system). I believe the advantage over a relational database comes from Lucene's inverted index, which is highly optimized for this kind of lookup. My concern is that this just shifts the scaling issue to Lucene, and I haven't found much info on how to scale Lucene vertically. -Original Message- From: Kevin L. Cobb [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:40 AM To: [EMAIL PROTECTED] Subject: Opinions: Using Lucene as a thin database I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) 3) Give the JVM more memory to play with by using -Xms and -Xmx JVM parameters 4) See IndexWriter's minMergeDocs parameter. 5) Are you calling optimize() at some point by any chance? Leave that call for the end. 1500 documents with 30 columns of short String/number values is not a lot. You may be doing something else not Lucene related that's slowing things down. Otis --- Homam S.A. [EMAIL PROTECTED] wrote: I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing HTML files give following message
Hello, This is probably due to some bad HTML. The application you are using is just a demo, and uses a JavaCC-based HTML parser, which may not be resilient to invalid HTML. For Lucene in Action we developed a little extensible indexing framework, and for HTML indexing we used 2 tools to handle HTML parsing: JTidy and NekoHTML. Since the code for the book is freely available... http://www.manning.com. NekoHTML knows how to deal with some bad HTML, that's why I'm suggesting this. The indexing framework could come handy for those working on various 'desktop search' applications (Roosster, LDesktop (if that's really happening), Lucidity, etc.) Otis --- Hetan Shah [EMAIL PROTECTED] wrote: java org.apache.lucene.demo.IndexHTML -create -index /source/workarea/hs152827/newIndex .. adding ../0/10037.html adding ../0/10050.html adding ../0/1006132.html adding ../0/1013223.html Parse Aborted: Encountered \ at line 5, column 1. Was expecting one of: ArgName ... = ... TagEnd ... And then the indexing hangs on this line. Earlier it used to go on and index remaining pages in the directory. Any idea why would the indexer stop at this error. Pointers are much needed and appreciated. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Finding unused segment files?
Hello George, Here is a quick hack (with a few TODOs). I only tested it a bit, so the actual delete calls are still commented out. If this works for you, and especially if you take care of TODOs, I may put this in the Lucene Sandbox. Otis P.S. Usage example showing how the fool found some unused segments (this was caused by a bug in one of the earlier 1.4 versions of Lucene). [EMAIL PROTECTED] java]$ java org.apache.lucene.index.SegmentPurger /simpy/users/1/index Candidate non-Lucene file found: _1b2.del Candidate unused Lucene file found: _1b2.cfs Candidate unused Lucene file found: _1bm.cfs Candidate unused Lucene file found: _1c6.cfs Candidate unused Lucene file found: _1cq.cfs Candidate unused Lucene file found: _1da.cfs Candidate unused Lucene file found: _1du.cfs Candidate unused Lucene file found: _1ee.cfs Candidate unused Lucene file found: _1ey.cfs [EMAIL PROTECTED] java]$ [EMAIL PROTECTED] java]$ strings /simpy/users/1/index/segments _3o0 [EMAIL PROTECTED] java]$ ls -al /simpy/users/1/index/ total 647 drwxrwsr-x2 otis simpy1024 Dec 7 14:39 . drwxrwsr-x3 otis simpy1024 Sep 16 20:39 .. -rw-rw-r--1 otis simpy 212815 Nov 17 18:36 _1b2.cfs -rw-rw-r--1 otis simpy 104 Nov 17 18:40 _1b2.del -rw-rw-r--1 otis simpy3380 Nov 17 18:40 _1bm.cfs -rw-rw-r--1 otis simpy3533 Nov 17 18:40 _1c6.cfs -rw-rw-r--1 otis simpy4774 Nov 17 18:40 _1cq.cfs -rw-rw-r--1 otis simpy3389 Nov 17 18:40 _1da.cfs -rw-rw-r--1 otis simpy3809 Nov 17 18:40 _1du.cfs -rw-rw-r--1 otis simpy3423 Nov 17 18:40 _1ee.cfs -rw-rw-r--1 otis simpy4016 Nov 17 18:40 _1ey.cfs -rw-rw-r--1 otis simpy 410299 Dec 7 14:39 _3o0.cfs -rw-rw-r--1 otis simpy 4 Dec 7 14:39 deletable -rw-rw-r--1 otis simpy 29 Dec 7 14:39 segments --- [EMAIL PROTECTED] wrote: Hello all. I recently ran into a problem where errors during indexing or optimization (perhaps related to running out of disk space) left me with a working index in a directory but with additional segment files (partial) that were unneeded. The solution for finding the ~40 files to keep out of the ~900 files in the directory amounted to dumping the segments file and noting that only 5 segments were in fact live. The index is a non-compound index using FSDirectory. Is there (or would it be possible to add (and I'd be willing to submit code if it made sense to people)) some sort of interrogation on the index of what files belonged to it? I looked first as FSDirectory itself thinking that it's list() method should return the subset of index-related files but looking deeper it looks like Directory is at a lower level abstracting simple I/O and thus wouldn't know. So any thoughts? Would it make sense to have a form of clean on IndexWriter()? I hesitate since it seems there isn't a charter that only Lucene files could exist in the directory thus what is ideal for my application (since I know I won't mingle other files) might not be ideal for all. Would it be fair to look for Lucene known extensions and file naming signatures to identify unused files that might be failed or dead segments? Thanks, -George package org.apache.lucene.index; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.FSDirectory; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Iterator; import java.io.File; /** * A tool that peeks into Lucene index directories and removes * unwanted files. In its more radical mode, this tool can be used to * remove all non-Lucene index files from a directory. The other * option is to remove unused Lucene segment files, should the index * directory get polluted. * * TODO: this tool should really lock the directory for writing before * removing any Lucene segment files, otherwise this tool itself may * corrupt the index. * * @author Otis Gospodnetic * @version $Id$ */ public class SegmentPurger { // TODO: copied from SegmentMerger - should probably made public // static final, to make it reusable // TODO: add .del extension // File extensions of old-style index files public static final String MULTIFILE_EXTENSIONS[] = new String[] { fnm, frq, prx, fdx, fdt, tii, tis }; public static final String VECTOR_EXTENSIONS[] = new String[] { tvx, tvd, tvf }; public static final String COMPOUNDFILE_EXTENSIONS[] = new String[] { cfs }; public static final String INDEX_FILES[] = new String[] { segments, deletable }; public static final String[][] SEGMENT_EXTENSIONS = new String[][] { MULTIFILE_EXTENSIONS, COMPOUNDFILE_EXTENSIONS, VECTOR_EXTENSIONS }; /** The file format version, a negative number. */ /* Works since counter, the old
RE: OutOfMemoryError with Lucene 1.4 final
Ying, You should follow this finally block advice below. In addition, I think you can just close the reader, and it will close the underlying stream (I'm not sure about that, double-check it). You are not running out of file handles, though. Your JVM is running out of memory. You can play with: 1) -Xms and -Xmx JVM command-line parameters 2) IndexWriter's parameters: mergeFactor and minMergeDocs - check the Javadocs for more info. They will let you control how much memory your indexing process uses. Otis --- Sildy Augustine [EMAIL PROTECTED] wrote: I think you should close your files in a finally clause in case of exceptions with file system and also print out the exception. You could be running out of file handles. -Original Message- From: Jin, Ying [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 11:15 AM To: [EMAIL PROTECTED] Subject: OutOfMemoryError with Lucene 1.4 final Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: maxDoc()
Hello Garrett, Share some code, it will be easier for others to help you that way. Obviously, this would be a huge bug if the problem were within Lucene. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Can anyone please explain to my why maxDoc returns 0 when Luke shows 239,473 documents? maxDoc returns the correct number until I delete a document. And I have called optimize after the delete but still the problem remains Strange. Any ideas greatly appreciated Garrett - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem restoring index
There is no need to reindex. However, I also don't quite get what the problem is :) Otis --- Santosh [EMAIL PROTECTED] wrote: hi, when I restart the tomcat . the Index is getting corrupted. If I take the backup of Index and then restarting tomcat. the Index is not working properly. Do I have to Index again all the documents whenever I restart the Tomcat? ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchig with special characters
Leading wildcard character (*) is not allowed if you use QueryParser that comes with Lucene. Reason: performance. See many discussions about this on lucene-user mailing list. Also see the search sytax document on the Lucene site. What other characters are you having trouble with? Otis --- Santosh [EMAIL PROTECTED] wrote: whenever I search with some special chracters like *world I am getting exception . how can I avoid this? and for what other characters lucene give this type of exceptions? ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Empty/non-empty field indexing question
Correct. No, there is no point in putting an empty field there. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi Otis What kind of implications does that produce on the search? If I understand correctly that record would not be searched for if the field is not there, correct? But then is there a point putting an empty value in it, if an application will never search for empty values? thanks -pedja Otis Gospodnetic said the following on 12/8/2004 1:31 AM: Empty fields won't add any value, you can skip them. Documents in an index don't have to be uniform. Each Document could have a different set of fields. Of course, that has some obvious implications for search, but is perfectly fine technically. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Here's probably a silly question, very newbish, but I had to ask. Since I have mysql documents that contain over 30 fields each and most of them are added to the index, is it a common practice to add fields to the index with empty values, for that perticular record, or should the field be totally omitted. What I mean is if let's say a Title field is empty on a specific record (in mysql) should I still add that field into Lucene index with an empty value or just skip it and only add the fields that contain non-empty values? thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'IN' type search
Hello, You can use BooleanQuery for that. Otis --- Ravi [EMAIL PROTECTED] wrote: Hi How do you get all documents in lucene where a particular field value is in a given list of values (like SQL IN). What kind of Query class should I use? Thanks in advance. Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: When is the book released?
Hello, Yes, Lucene in Action has been listed on Amazon for a while now (I think I recorded this in my blog some time back). The publish date is, I believe, the date provided by publishers, but things almost always take longer than predicted, so 31.12.2004 may be a bit off. :( However, the ebook should be out any time now, as Erik already mentioned. It's cheaper, saves trees, and doesn't consume precious horizontal surfaces in your home (I live in New York City, where large living spaces are hard to find unless you live in a former warehouse or pay big money). Otis just like lots of software --- Palmer, Andrew MMI Woking [EMAIL PROTECTED] wrote: I have just had a quick look at both the US and UK version of Amazon and they both list the book as Lucene In Action. I was curious as I work for the UK Bibliographic agency and it was on our database and should have been on Amazon for a least a couple of weeks. The agency has known about the book since start of September. It has a publication date of 31/12/2004. Andrew -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: 07 December 2004 12:45 To: Lucene Users List Subject: Re: When is the book released? On Dec 7, 2004, at 5:27 AM, Aad Nales wrote: Sorry if this is a misspost but I have been visiting Amazon daily the last few weeks and I still can't get the Lucene book there. How will I survive the holidays? :-) But seriously when can we expect the release? Manning will have the electronic book version available *TODAY* (hopefully). It has been sent to the printers and this process takes a few weeks. I don't expect Amazon.com to be shipping them until January though - the book industry really is slow moving. Otis and I thank everyone for their patience and you can be sure that no one wants the book in their hands more then he and I :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryFilter vs CachingWrapperFilter vs RangeQuery
If you run the same query again, the IndexSearcher will go all the way to the index again - no caching. Some caching will be done by your file system, possibly, but that's it. Lucene is fast, so don't optimize early. Otis --- Ben Rooney [EMAIL PROTECTED] wrote: thanks chris, you are correct that i'm not sure if i need the caching ability. it is more to understand right now so that if we do need to implement it, i am able to. the reason for the caching is that we will have listing pages for certain content types. for example a listing page of articles. this listing will be generated against lucene engine using a basic query. the page will also have the ability to filter the articles based on date range as one example. so caching those results could be beneficial. however, we will also potentially want to cache the basic query so that subsequent queries will hit a cache. when new content is published or content is removed from the site, the caches will need to be invalidated so new results are created. for the basic query, is there any caching mechanism built into the SearchIndexer or do we need to build our own caching mechanism? thanks ben On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote: : executes the search, i would keep a static reference to SearchIndexer : and then when i want to invalidate the cache, set it to null or create : design of your system. But, yes, you do need to keep a reference to it : for the cache to work properly. If you use a new IndexSearcher : instance (I'm simplifying here, you could have an IndexReader instance : yourself too, but I'm ignoring that possibility) then the filtering : process occurs for each search rather than using the cache. Assuming you have a finite number of Filters, and assuming those Filters are expensive enough to be worth it... Another approach you can take to share the cache among multiple IndexReaders is to explicitly call the bits method on your filter(s) once, and then cache the resulting BitSet anywhere you want (ie: serialize it to disk if you so choose). and then impliment a BitsFilter class that you can construct directly from a BitSet regardless of the IndexReader. The down side of this approach is that it will *ONLY* work if you arecertain that the index is never being modified. If any documents get added, or the index gets re-optimized you must regenerate all of the BitSets. (That's why the CachingWrapperFilter's cache is keyed off of hte IndexReader ... as long as you're re-using the same IndexReader, it know's that the cached BitSet must still be valid, because an IndexReader allways sees the same index as when it was opened, even if another thread/process modifies it.) class BitsFilter { BitSet bits; public BitsFilter(BitSet bits) { this.bits=bits; } public BitSet bigs(IndexReader r) { return bits.clone(); } } -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Empty/non-empty field indexing question
Empty fields won't add any value, you can skip them. Documents in an index don't have to be uniform. Each Document could have a different set of fields. Of course, that has some obvious implications for search, but is perfectly fine technically. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Here's probably a silly question, very newbish, but I had to ask. Since I have mysql documents that contain over 30 fields each and most of them are added to the index, is it a common practice to add fields to the index with empty values, for that perticular record, or should the field be totally omitted. What I mean is if let's say a Title field is empty on a specific record (in mysql) should I still add that field into Lucene index with an empty value or just skip it and only add the fields that contain non-empty values? thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: addIndexes() Size
If I were you, I would first use Luke to peek at the index. You may find something obvious there, like multiple copies of the same Document. Does your temp index 'overlap' with A index in terms of Documents? If so, you will end up with multliple copies, as addIndexes method doesn't detect and remove duplicate Documents. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi. Its probably really simple to explain this but since I'm not up to speed on the way Lucene stores the data I'm a little confused. I'm building an Index, which resides on Server A, with the Lucene Service running on Server B. Now not to bore you with the details but because of the network transfer rate etc I'm running the actual index on \\ServerA\idx file:///\\ServerA\idx and building a temp Index at \\ServerB\idx\temp file:///\\ServerB\idx\temp (obviously because the Local FS is much faster for the service) and then calling addIndexes to import the temp index to the ServerA index before destroying the ServerB index, holding for a bit and then checking for new documents. All works grand BUT the size of the resultant index on ServerA is HUGE in comparison to one I'd build from start to finish (i.e. a simple addDocument Index) - 38gig for 220,000 Unstored Items cannot be right (to give you and idea of how mad this seems, the backed up version of the database from which the data is pulled is only 2gigs) I've considered it being perhaps the number of Items that had to be integrated each time addIndexes was called - right now I'm adding around 10,000 at a time (I had done 1000 at a time but this looked like it was going to end up even larger still) I'm holding off twiddling the minMergeDocs and mergeFactor until I can get a better understanding of whats going on here. Many thanks for any reply's Garrett - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index delete failing
This smells like a Windows issue. It is possible that something in your JVM is still holding onto the index directory (for example, FSDirectory), and Winblows is not letting you remove the directory. I bet this will work if you exit the JVM and run java.io.file.delete() without calling Lucene. Sorry, my Windows + Lucene experience is limited. Otis --- Ravi [EMAIL PROTECTED] wrote: Hi We need to delete a lucene index from our application using java.io.file.delete(). We are closing the indexWriter and even all the index searchers on that folder. But a call to delete returns false. There is no lock on the index directory. Interesting thing is that the deletable and segments files are getting removed. But the rest of the .cfs are not. Has somebody had similar problem? Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Single Digit Indexing
Hm, if you can index 11, you should be able to index 8 as well. In any case, you most likely want to make sure that your Analyzer is not just throwing your numbers out. This may stillbe up to date: http://www.jguru.com/faq/view.jsp?EID=538308 See also: http://wiki.apache.org/jakarta-lucene/HowTo Otis --- Bill von Ofenheim (LaRC) [EMAIL PROTECTED] wrote: How can I get Lucene to index single digits (e.g. 8 as in Gemini 8)? I am able to index numbers with two or more digits (e.g. 11 as in Apollo 11). Thanks, Bill von Ofenheim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is this a bug or a feature with addIndexes?
Hello, Try changing IndexWriter's mergeFactor variable. It's 10 by default. Change it to 1, for instance. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Greetings, Ok, so maybe this is common knowledge to most of you but I'm a lamen when it comes to Lucene and I couldnt find any details about this after some searching. When you merge two indexes via addIndexes, does it only work in batches (10 or more documents)? Because I've been banging my head off the wall wondering why my code does not want to index 1 (one) document and then I went to run Otis's MemoryVsDisk class from http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html?page=last but I didnt use 10,000 documents as suggested, I used 5 and 15 instead. And what do you know, less than 10 it doesnt merge at all while more than 10 it will merge only first 10 documents and gently forget about the other 5. My project requires me to index/update one single document as required and make it immediately available for searching. How do I accomplish this if index merging will not merge less than 10 and in increments of 10, and single indexing doesnt seem to do it at all (please see my other post http://marc.theaimsgroup.com/?l=lucene-userm=110237364203877w=2) thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: restricting search result
This is entirely application-specific. As the simplest approach, you can index each user's documents in a separate index and use (Parallel)MultiSearcher to search appropriate indices (which ones are appropriate to search has to be a part of your app's access control logic). Otis --- Paul [EMAIL PROTECTED] wrote: Hi, how yould you restrict the search results for a certain user? I'm indexing all the existing data in my application but there are certain access levels so some users should see more results then an other. Each lucene document has a field with an internal id and I want to restrict on that basis. I tried it with adding a long concatenation of my ids (+locationId:1 +locationId:3 + ...) but this throws a More than 32 required/prohibited clauses in query. exception. Any suggestions? thx! Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs
In my experiments with mergeFactor I found the point of diminishing/no returns. If I remember correctly, I hit the limit at mergeFactor of 50. But here is something from Lucene in Action that you can use to play with various index tuning factors and see their effect on indexing performance. It's simple, and if you want to test all 3 of your scenarios, you will have to modify it. package lia.indexing; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; /** * */ public class IndexTuningDemo { public static void main(String[] args) throws Exception { int docsInIndex = Integer.parseInt(args[0]); // create an index called 'index-dir' in a temp directory Directory dir = FSDirectory.getDirectory( System.getProperty(java.io.tmpdir, tmp) + System.getProperty(file.separator) + index-dir, true); Analyzer analyzer = new SimpleAnalyzer(); IndexWriter writer = new IndexWriter(dir, analyzer, true); // set variables that affect speed of indexing writer.mergeFactor = Integer.parseInt(args[1]); writer.maxMergeDocs = Integer.parseInt(args[2]); writer.minMergeDocs = Integer.parseInt(args[3]); writer.infoStream= System.out; System.out.println(Merge factor:+ writer.mergeFactor); System.out.println(Max merge docs: + writer.maxMergeDocs); System.out.println(Min merge docs: + writer.minMergeDocs); long start = System.currentTimeMillis(); for (int i = 0; i docsInIndex; i++) { Document doc = new Document(); doc.add(Field.Text(fieldname, Bibamus)); writer.addDocument(doc); } writer.close(); long stop = System.currentTimeMillis(); System.out.println(Time: + (stop - start) + ms); } } Otis --- Chuck Williams [EMAIL PROTECTED] wrote: I'm wondering what values of mergeFactor, minMergeDocs and maxMergeDocs people have found to yield the best performance for different configurations. Is there a repository of this information anywhere? I've got about 30k documents and have 3 indexing scenarios: 1. Full indexing and optimize 2. Incremental indexing and optimize 3. Parallel incremental indexing without optimize Search performance is critical. For both cases 1 and 2, I'd like the fastest possible indexing time. For case 3, I'd like minimal pauses and no noticeable degradation in search performance. Based on reading the code (including the javadocs comments), I'm thinking of values along these lines: mergeFactor: 1000 during Full indexing, and during optimize (for both cases 1 and 2); 10 during incremental indexing (cases 2 and 3) minMergeDocs: 1000 during Full indexing, 10 during incremental indexing maxMergeDocs: Integer.MAX_VALUE during full indexing, 1000 during incremental indexing Do these values seem reasonable? Are there better settings before I start experimenting? Since mergeFactor is used in both addDocument() and optimize(), I'm thinking of using two different values in case 2: 10 during the incremental indexing, and then 1000 during the optimize. Is changing the value like this going to cause a problem? Thanks for any advice, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document-Map, Hits-List
Yes, it's not wise to just pull all Document instances from Hits instance, unless you really need them all. I don't do that, I really just provide a wrapper, like this: /** * A simple List implementation wrapping a Hits object. * * @author Otis Gospodnetic * @version $Id: HitList.java,v 1.4 2004/11/11 14:08:33 otis Exp $ */ public class HitList extends AbstractList { private Hits _hits; /** * Creates a new codeHitList/code instance. * * @param hits codeHits/code to wrap */ public HitList(Hits hits) { _hits = hits; } /** * @see java.util.List#get(int) */ public Object get(int index) { try { return _hits.doc(index); } catch (IOException e) { throw new RuntimeException(e); } } /** * @see java.util.List#size() */ public int size() { return _hits.length(); } ... ... Otis --- Luke Francl [EMAIL PROTECTED] wrote: On Wed, 2004-12-01 at 10:27, Otis Gospodnetic wrote: This is very similar to what I do - I create a List of Maps from Hits and its Documents. So I think this change may be handy, if doable (I didn't look into changing the two Lucene classes, actually). How do you avoid the problem Eric just mentioned, iterating through all the Hits at once to populate this data structure? I do a similar thing, creating a List of asset references from a field in each Lucene Document in my Hits list (actual data for display retrieved from a separate datastore). I was not aware of any performance problems from doing this, but now I am wondering about the implications. Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter.optimize and memory usage
Hello and quick answers: See IndexWriter javadoc and in particular mergeFactor, minMergeDocs, and maxMergeDocs. This will let you control the size of your segments, the frequency of segment merges, the amount of buffered Documents in RAM between segment merges and such. Also, you ask about calling optimize periodically - no need, Lucene should already merge segments once in a while for you. Optimize at the end. You can also experiment with different JVM args for various GC algorithms. Otis --- Chris Hostetter [EMAIL PROTECTED] wrote: I've been running into an interesting situation that I wanted to ask about. I've been doing some testing by building up indexes with code that looks like this... IndexWriter writer = null; try { writer = new IndexWriter(index, new StandardAnalyzer(), true); writer.mergeFactor = MERGE_FACTOR; PooledExecutor queue = new PooledExecutor(NUM_UPDATE_THREADS); queue.waitWhenBlocked(); for (int min=low; min high; min += BATCH_SIZE) { int max = min + BATCH_SIZE; if (high max) { max = high; } queue.execute(new BatchIndexer(writer, min, max)); } end = new Date(); System.out.println(Build Time: + (end.getTime() - start.getTime()) + ms); start = end; writer.optimize(); } finally { if (null != writer) { try { writer.close(); } catch (Exception ignore) {/*NOOP*/; } } } end = new Date(); System.out.println(Optimize Time: + (end.getTime() - start.getTime()) + ms); (where BatchIndexer is a class i have that gets a DB connection, and slurps all records from my DB between min and max and builds some simple Documents out of them and calls writer.addDocument(doc) on each) This was working fine with small ranges, but then i tried building up a nice big index for doing some performance testing. i left it running overnight and when i came back in the morning i discovered that after successfully building up the whole index (~112K docs, ~1.5GB disk) it crashed with an OutOfMemory exception while trying to optimize. I then realized i was only running my JVM with a 256m upper limit on RAM, and i figured that PooledExecutor was still in scope, and maybe it was maintaining some state that was using up a lot of space, so i whiped up a quick little app to solve my problem... public static void main(String[] args) throws Exception { IndexWriter writer = null; try { writer = new IndexWriter(index, new StandardAnalyzer(), false); writer.optimize(); } finally { if (null != writer) { try { writer.close(); } catch (Exception ignore) { /*NOOP*/; } } } } ...but I was dissapointed to discover that even this couldn't run with only 256m of ram. I bumped it up to 512m and then it manged to complete successfully (the final index was only 1.1GB of disk). This raises a few questions in my mind: 1) Is there a rule of thumb for knowing how much memory it takes to optimize an index? 2) Is there a Best Practice to follow when building up a large index from scratch in order to reduce the amount of memory needed to optimize once the whole index is build? (ie: would spining up a thread that called writer.optimize() every N minutes be a good idea?) 3) Given an unoptimized index that's allready been built (ie: in the case where my builder crashed and i wanted to try and optimize it without having to rebuild from scratch) is there anyway to get IndexWriter to use less RAM and more disk (trading spead for a smaller form factor -- and aparently: greater stability so that the app doesn't crash) I imagine that the answers to #1 and #2 are largely dependent on the nature of the data in the index (ie: the frequency of terms) but i'm wondering if there is a high level formula that could be used to say based on the nature of your data, you want to take this approach to optimizing when you build - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document-Map, Hits-List
This is very similar to what I do - I create a List of Maps from Hits and its Documents. So I think this change may be handy, if doable (I didn't look into changing the two Lucene classes, actually). Otis --- petite_abeille [EMAIL PROTECTED] wrote: On Dec 01, 2004, at 13:37, Karthik N S wrote: We create a ArrayList Object and Load all the Hit Values into them and return the same for Display purpose on a Servlet. Talking of which... It would be very handy if org.apache.lucene.search.Hits would implement the java.util.List interface... in addition, org.apache.lucene.document.Document could implement java.util.Map... That way, the rest of the application could pretend to simply have to deal with a List of Maps, without having to get exposed to any Lucene internals... Thought? Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
Hello, Lucene indexing completes in 13-15 hours on the desktop system while it completes in about 29-33 hours on the notebook. Now, combine it with the DROP INDEX tests completing in the same amount of time on both and find out why is the search only slightly faster :) Until then, all your measurements are subjective and you don't gain much by comparing the two indexing processes. I'm worried about searching. Indexing is a lot faster on the desktop config. This tells you that your problem is not the disk itself, and not the fielsystem. The bottleneck is elsewhere. Why not run your search under a profiler? That will tell you where the JVM is spending its time. It may even be in some weird InetAddress call, like another person already pointed out. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: similarity matrix - more clear
Hello, I don't think Lucene can spit out the similarity matrix for you, but perhaps you can use Lucene's Term Vector support to help you build the matrix yourself: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html The other relevant sections of the Lucene API to look at are: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int) http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader,%20boolean) ... This should let you tell Lucene to compute and store term vectors during indexing, and then you will be able to retrieve a Term Vector for each Document in the index/collection. Armed with this data you should be able to compute similarities between Documents with TV dot products/cosines, which should be enough for you to build your similarity matrix. This sounds like something that would be nice to have in the Lucene Sandbox, so if you end up with some code that you are allowed to share, please contribute it back to Lucene. Otis --- Roxana Angheluta [EMAIL PROTECTED] wrote: Dear all, Yesterday I've asked a question about geting the similarity matrix of a collection of documents from an index, but I got only one answer, so perhaps my question was not very clear. I will try to reformulate: I want to use Lucene to have efficient access to an index of a collection of documents. My final purpose is to cluster documents. Therefore I need to have for each pair of documents a number signifying the similarity between them. A possible solution would be to initialize in turn each document as a query, do a search using an IndexSearcher and to take from the search result the similarity between the query (which is in fact a document) and all the other documents. This is highly redundant, because the similarity between a pair of documents is computed multiple times. I was wondering whether there is a simpler way to do it, since the index file contains all the information needed. Can anyone help me here? Thanks, roxana PS I know about the project Carrot2, which deals with document clustering, but I think is not appropriate for me because of 2 reasons: 1) I need to keep the index on the disk for further reusage 2) I need to be able to search efficiently in the index I thought Lucene can help me here, am I wrong? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does QueryParser uses Analyzer ?
QueryParser does use Analyzer, see this: static public Query parse(String query, String field, Analyzer analyzer) throws ParseException { QueryParser parser = new QueryParser(field, analyzer); return parser.parse(query); } Otis P.S. Use lucene-user list, please. --- Ricardo Lopes [EMAIL PROTECTED] wrote: Does the QueryParser class really uses the Analyzer passed to the parse method ? I look at the code and i dont the object beeing used anywhere in the class. The problem is that i am writting an application with lucene that searches using a foreign language with latin characters, the indexing works fine, but the search aparently doesn't call the Analyzer. Here is an example: i have a file that contains the following word: memória if i search for: memoria (without the puntuation charecter in the o) it finds the word, which is correct if i search for: memória (the exact same word) it doesn't find the word, because the QueryParser splits the word to mem ria, but if the analyzer were called the ó would be replaced to o. I guess the analyzer isn't called, is this right? Thanks in advance, Ricardo Lopes - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]