bookkeeping documents cause problem in Sort
I understand that unlike relational database, Lucene is flexible in having documents with different set of fields. My index has documents with a date and content field. There are also a few book keeping documents that does not have the date field. Things work well except in one case: Sort sort = Sort('date'); searcher.search(query, sort); In this case an exception is thrown: java.lang.RuntimeException: field date does not appear to be indexed It does not make sense to sort by 'date' when the document does not has 'date'. On the other hand I don't expect the search() to return any book keeping documents at all since the current look for fields not in those documents. Is this an implementation issue or is there any inherent reason all document need to have the 'date' field if it is sorted? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Unicode Usage
So you got a utf8 encoded text file. But how do you read the file into Java? The default encoding of Java is likely to be something other than utf8. Make sure you specify the encoding like: InputStreamReader( new FileInputStream(filename), UTF-8); On Wed, 9 Feb 2005 22:32:38 -0700, Owen Densmore [EMAIL PROTECTED] wrote: I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up: tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF. BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer subclasses which handle unicode correctly) shows that indeed the files now have odd characters when viewed w/ jEdit and BBEdit. The analyzer used to build the index looks like: public class RedfishAnalyser extends Analyzer { String[] stopwords; public RedfishAnalyser(String[] stopwords) { this.stopwords = stopwords; } public RedfishAnalyser() { this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS; } public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), stopwords)); } } Yikes, what am I doing wrong?! Is the analyzer at fault? Its about the only place where I can see a problem happening. Thanks for any pointers, Owen -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
For all parser suggestion I think there is one important attribute. Some parsers returns data provide that the input HTML is sensible. Some parsers is designed to be most flexible as tolerant as it can be. If the input is clean and controlled the former class is sufficient. Even some regular expression may be sufficient. (I that's the original poster wants). If you are building a web crawler you need something really tolerant. Once I have prototyped a nice and fast parser. Later I have to abandon it because it failed to parse about 15% documents (problem handling nested quotes like onclick=alert('hi')). No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hits and HitCollector performance
I am trying to do some filtering and rearrangement of search result. Two possiblity come into mind are iterating though the Hits or making custom HitCollector. All documentation invaribly warn about the performance impact of using HitCollector with large result set. The scenario that google return 10s of millions of documents comes into mind. But I'm thinking, wouldn't Hits also have to fill up an array with millions of integer id at least? Or does it only return the correct lenght but build the result array on demand? Another idea I have is first gone through the first n hits, let say 1000, which I filter and rearrange. If user ever need the result pass 1000 the get the result from Hits. Is there any recommended way in these situations? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Subversion conversion
Subversion rocks! I have just setup the Windows svn client TortoiseSVN with my favourite file manager Total Commander 6.5. The svn status and commands are readily integrated with the file manager. Offline diff and revert are two things I really like from svn. The conversion to Subversion is complete. The new repository is available to users read-only at: http://svn.apache.org/repos/asf/lucene/java/trunk Besides /trunk, there is also /branches and /tags. /tags contains all the CVS tags made so that you could grab a snapshot of a previous version. /trunk is analogous to CVS HEAD. You can learn more about the Apache repository configuration here and how to use the command-line client to check out the repository: http://www.apache.org/dev/version-control.html Learn about Subversion, including the complete O'Reilly Subversion book in electronic form for free here: http://subversion.tigris.org For committers, check out the repository using https and your Apache username/password. The Lucene sandbox has been integrated into our single Subversion repository, under /java/trunk/sandbox: http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/ The Lucene CVS repositories have been locked for read-only. If there are any issues with this conversion, let me know and I'll bring them to the Apache infrastructure group. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ANNOUNCE: MindRetrieve 0.4 - Search the web you have seen
I am pleased to announce that MindRetrieve 0.4.0 has been released. MindRetrieve is a desktop search tool to help users to search and organize the web they have seen. Download it from http://mindretrieve.berlios.de/. Everyday we read a large amount of information from the world wide web. The truth is most of them does not register. We often search for similar topics time after time. It is time consuming. Often after much work we find that we are looking at the same old documents like a déjà vu. MindRetrieve is here to help you to find information buried deep in your memory. MindRetrieve is a lightweight, cross-platform, open source application available under the BSD license. It has been tested on Windows and Linux with the latest versions of Firefox, Opera and IE. Mac support is planned. Finally I would like to thanks the Lucene and PyLucene team for making such wonderful software available. Also thanks for all the help you have provided in these discussion groups. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action hits desk in UK
On Wed, 26 Jan 2005 11:42:52 +, John Haxby [EMAIL PROTECTED] wrote: My copy of Lucene in Action has finally hit my desk in the UK. Hopefully the dispatch time quoted by amazon.co.uk will now start to drop to something more sensible. It's been interesting watching the price changes. When I ordered my copy back in November, I paid £19.38 for it. At around the time of publication, the price went up to £35.99, the list price. It's currently priced at £25.19, 30% off list price. jch I noticed the price swing at Amazon too. I find the best price at www.bookpool.com at US$27.5. Great book by the way. Not just about using Lucene but touches on a wide range of background and related topics as well. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to give recent documents a boost?
What is the best way to give recent documents a boost? Not sorting them by strict date order but to give them some preference. If document 1 filed last week has a score of 0.5 and document 2 filed last month has a score of 0.55, then list document 1 first. But if document 1 has a score of only 0.05, then keep it at the end. Any experience of fine tuning by date order? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I would love to give it a try. Please email me at aurora00 at gmail.com. Thanks! Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some people actually said the StandardAnalyzer works better. I wonder what's the pros and cons. I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene and multiple languages
I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer. Now let's said I have an index with documents in multiple languages and analyzed by an assortment of analyzers. When user enter a query, what analyzer should be used? Should the user be asked for the language upfront? What to expect when the analyzer and the document doesn't match? Let's said the query is parsed using StandardAnalyzer. Would it match any documents done in German analyzer at all. Or would it end up in poor result? Also is there a good way to find out the languages used in a web page? There is a 'content-langage' header in http and a 'lang' attribute in HTML. Looks like people don't really use them. How can we recognize the language? Even more interesting is multiple languages used in one document, let's say half English and half French. Is there a good way to deal with those cases? Thanks for any guidance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how often to optimize?
Are not optimized indices causing you any problems (e.g. slow searches, high number of open file handles)? If no, then you don't even need to optimize until those issues become... issues. OK I have changed the process to not doing optimize() at all. So far so good. The number of files hover from 10 to 40 during the indexing of 10,000 files. Seems Lucene is doing some kind of self maintenance to keep things in order. Is it right to say optimize() is a totally optional operation? I probably get the impression it is a natural step to end an incremental update from the IndexHTML example. Since it replicates the whole index it might be an overkill for many applications to do daily. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
Thanks for the heads up. I'm using Lucene 1.4.2. I tried to do optimize() again but it has no effect. Adding a just tiny dummy document would get rid of it. I'm doing optimize every few hundred documents because I tried to simulate incremental update. This lead to another question I would post separately. Thanks. Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing speed. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how often to optimize?
Right now I am incrementally adding about 100 documents to the index a day and then optimize after that. I find that optimize essentially rebuilding the entire index into a single file. So the size of disk write is proportion to the total index size, not to the size of documents incrementally added. So my question is would it be an overkill to optimize everyday? Is there any guideline on how often to optimize? Every 1000 documents or more? Every week? Is there any concern if there are a lot of documents added without optimizing? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
index size doubled?
I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
auto-generate uid?
Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-generate uid?
Just to clarify. I have a Field 'uid' those value is an unique integer. I use it as a key to the document stored externally. I don't mean Lucene's internal document number. I was wonder if there is a method to query the highest value of a field, perhaps something like: IndexReader.maxTerm('uid') What would the purpose of an auto-generated UID be? But no, Lucene does not generate UID's for you. Documents are numbered internally by their insertion order. This number changes, however, when documents are deleted in the middle and the index is optimized. Erik On Nov 22, 2004, at 1:50 PM, aurora wrote: Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
using lucene as a dictionary database?
Besides full text indexing, I need a database that represent a large dictionary like: (key1, key2) - docid I am considering between building a home grown solution and using Berkeley DB. Then I think I was using Lucene anyway, wouldn't it make sense use it as my database too? Just make key1 and key2 two keyword fields and an UnIndexed field for docid? I need to do something like get(key1, key2) - docid get(key1) - list of docid This need to be fast add( list of (key1,key2,docid) ) This would be done perhaps once a day in a batch. My experience with Lucene is its very efficient in terms of speed and storage size. Would this be a right usage with Lucene? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]