bookkeeping documents cause problem in Sort

2005-02-16 Thread aurora
I understand that unlike relational database, Lucene is flexible in having documents with different set of fields. My index has documents with a date and content field. There are also a few book keeping documents that does not have the date field. Things work well except in one case: Sort

Re: Lucene Unicode Usage

2005-02-09 Thread aurora
So you got a utf8 encoded text file. But how do you read the file into Java? The default encoding of Java is likely to be something other than utf8. Make sure you specify the encoding like: InputStreamReader( new FileInputStream(filename), UTF-8); On Wed, 9 Feb 2005 22:32:38 -0700, Owen

Re: which HTML parser is better?

2005-02-03 Thread aurora
For all parser suggestion I think there is one important attribute. Some parsers returns data provide that the input HTML is sensible. Some parsers is designed to be most flexible as tolerant as it can be. If the input is clean and controlled the former class is sufficient. Even some regular

Hits and HitCollector performance

2005-02-03 Thread aurora
I am trying to do some filtering and rearrangement of search result. Two possiblity come into mind are iterating though the Hits or making custom HitCollector. All documentation invaribly warn about the performance impact of using HitCollector with large result set. The scenario that google

Re: Subversion conversion

2005-02-02 Thread aurora
Subversion rocks! I have just setup the Windows svn client TortoiseSVN with my favourite file manager Total Commander 6.5. The svn status and commands are readily integrated with the file manager. Offline diff and revert are two things I really like from svn. The conversion to Subversion

ANNOUNCE: MindRetrieve 0.4 - Search the web you have seen

2005-01-31 Thread aurora
I am pleased to announce that MindRetrieve 0.4.0 has been released. MindRetrieve is a desktop search tool to help users to search and organize the web they have seen. Download it from http://mindretrieve.berlios.de/. Everyday we read a large amount of information from the world wide web. The

Re: Lucene in Action hits desk in UK

2005-01-26 Thread aurora
On Wed, 26 Jan 2005 11:42:52 +, John Haxby [EMAIL PROTECTED] wrote: My copy of Lucene in Action has finally hit my desk in the UK. Hopefully the dispatch time quoted by amazon.co.uk will now start to drop to something more sensible. It's been interesting watching the price changes. When

How to give recent documents a boost?

2005-01-25 Thread aurora
What is the best way to give recent documents a boost? Not sorting them by strict date order but to give them some preference. If document 1 filed last week has a score of 0.5 and document 2 filed last month has a score of 0.55, then list document 1 first. But if document 1 has a score of

Re: Search Chinese in Unicode !!!

2005-01-21 Thread aurora
I would love to give it a try. Please email me at aurora00 at gmail.com. Thanks! Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some people actually said the StandardAnalyzer works better. I wonder what's the pros and cons. I've written a Chinese Analyzer for Lucene that

Lucene and multiple languages

2005-01-20 Thread aurora
I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer. Now let's said I have an

Re: how often to optimize?

2004-12-28 Thread aurora
Are not optimized indices causing you any problems (e.g. slow searches, high number of open file handles)? If no, then you don't even need to optimize until those issues become... issues. OK I have changed the process to not doing optimize() at all. So far so good. The number of files hover

Re: index size doubled?

2004-12-21 Thread aurora
care about indexing speed. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after

how often to optimize?

2004-12-21 Thread aurora
Right now I am incrementally adding about 100 documents to the index a day and then optimize after that. I find that optimize essentially rebuilding the entire index into a single file. So the size of disk write is proportion to the total index size, not to the size of documents

index size doubled?

2004-12-20 Thread aurora
I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below:

auto-generate uid?

2004-11-22 Thread aurora
Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Re: auto-generate uid?

2004-11-22 Thread aurora
, 2004, at 1:50 PM, aurora wrote: Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED

using lucene as a dictionary database?

2004-11-03 Thread aurora
Besides full text indexing, I need a database that represent a large dictionary like: (key1, key2) - docid I am considering between building a home grown solution and using Berkeley DB. Then I think I was using Lucene anyway, wouldn't it make sense use it as my database too? Just make key1 and