bookkeeping documents cause problem in Sort

2005-02-16 Thread aurora
I understand that unlike relational database, Lucene is flexible in having  
documents with different set of fields. My index has documents with a date  
and content field. There are also a few book keeping documents that does  
not have the date field. Things work well except in one case:

  Sort sort = Sort('date');
  searcher.search(query, sort);
In this case an exception is thrown:
  java.lang.RuntimeException: field date does not appear to be indexed
It does not make sense to sort by 'date' when the document does not has  
'date'. On the other hand I don't expect the search() to return any book  
keeping documents at all since the current look for fields not in those  
documents. Is this an implementation issue or is there any inherent reason  
all document need to have the 'date' field if it is sorted?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Unicode Usage

2005-02-09 Thread aurora
So you got a utf8 encoded text file. But how do you read the file into  
Java? The default encoding of Java is likely to be something other than  
utf8. Make sure you specify the encoding like:

  InputStreamReader( new FileInputStream(filename), UTF-8);
On Wed, 9 Feb 2005 22:32:38 -0700, Owen Densmore [EMAIL PROTECTED]  
wrote:

I'm building an index from a FileMaker database by dumping the data to a  
tab-separated file.  Because the FileMaker output is encoded in  
MacRoman, and uses Mac line separators, I run a script across the tab  
file to clean it up:
	tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs  
(for inter-field CRs) with blanks, and runs a character converter to  
build utf-8 data for Java to use.  Looks fine in jEdit and BBEdit, both  
of which understand UTF.

BUT -- when I look at the indexes created in Lucene using Luke, I get  
unprintable letters!  Writing programs to dump the terms (using Writer  
subclasses which handle unicode correctly) shows that indeed the files  
now have odd characters when viewed w/ jEdit and BBEdit.

The analyzer used to build the index looks like:
 public class RedfishAnalyser extends Analyzer {
   String[] stopwords;
   public RedfishAnalyser(String[] stopwords) {
 this.stopwords = stopwords;
   }
   public RedfishAnalyser() {
 this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS;
   }
   public TokenStream tokenStream(String fieldName, Reader reader) {
 return new PorterStemFilter(
 new StopFilter(
 new LowerCaseFilter(
 new StandardFilter(
 new StandardTokenizer(reader))),
stopwords));
   }
 }
Yikes, what am I doing wrong?!  Is the analyzer at fault?  Its about the  
only place where I can see a problem happening.

Thanks for any pointers,
Owen

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-03 Thread aurora
For all parser suggestion I think there is one important attribute. Some  
parsers returns data provide that the input HTML is sensible. Some parsers  
is designed to be most flexible as tolerant as it can be. If the input is  
clean and controlled the former class is sufficient. Even some regular  
expression may be sufficient. (I that's the original poster wants). If you  
are building a web crawler you need something really tolerant.

Once I have prototyped a nice and fast parser. Later I have to abandon it  
because it failed to parse about 15% documents (problem handling nested  
quotes like onclick=alert('hi')).

No one has yet mentioned using ParserDelegator and ParserCallback that  
are part of HTMLEditorKit in Swing.  I have been successfully using  
these classes to parse out the text of an HTML file.  You just need to  
extend HTMLEditorKit.ParserCallback and override the various methods  
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hits and HitCollector performance

2005-02-03 Thread aurora
I am trying to do some filtering and rearrangement of search result. Two  
possiblity come into mind are iterating though the Hits or making custom  
HitCollector.

All documentation invaribly warn about the performance impact of using  
HitCollector with large result set. The scenario that google return 10s of  
millions of documents comes into mind. But I'm thinking, wouldn't Hits  
also have to fill up an array with millions of integer id at least? Or  
does it only return the correct lenght but build the result array on  
demand?

Another idea I have is first gone through the first n hits, let say 1000,  
which I filter and rearrange. If user ever need the result pass 1000 the  
get the result from Hits.

Is there any recommended way in these situations?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Subversion conversion

2005-02-02 Thread aurora
Subversion rocks!
I have just setup the Windows svn client TortoiseSVN with my favourite  
file manager Total Commander 6.5. The svn status and commands are readily  
integrated with the file manager. Offline diff and revert are two things I  
really like from svn.


The conversion to Subversion is complete.  The new repository is  
available to users read-only at:

http://svn.apache.org/repos/asf/lucene/java/trunk
Besides /trunk, there is also /branches and /tags.  /tags contains all  
the CVS tags made so that you could grab a snapshot of a previous  
version.  /trunk is analogous to CVS HEAD.  You can learn more about the  
Apache repository configuration here and how to use the command-line  
client to check out the repository:

http://www.apache.org/dev/version-control.html
Learn about Subversion, including the complete O'Reilly Subversion book  
in electronic form for free here:

http://subversion.tigris.org
For committers, check out the repository using https and your Apache  
username/password.

The Lucene sandbox has been integrated into our single Subversion  
repository, under /java/trunk/sandbox:

http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/
The Lucene CVS repositories have been locked for read-only.
If there are any issues with this conversion, let me know and I'll bring  
them to the Apache infrastructure group.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


ANNOUNCE: MindRetrieve 0.4 - Search the web you have seen

2005-01-31 Thread aurora
I am pleased to announce that MindRetrieve 0.4.0 has been released.
MindRetrieve is a desktop search tool to help users to search and organize  
the web they have seen. Download it from http://mindretrieve.berlios.de/.

Everyday we read a large amount of information from the world wide web.  
The truth is most of them does not register. We often search for similar  
topics time after time. It is time consuming. Often after much work we  
find that we are looking at the same old documents like a déjà vu.  
MindRetrieve is here to help you to find information buried deep in your  
memory.

MindRetrieve is a lightweight, cross-platform, open source application  
available under the BSD license. It has been tested on Windows and Linux  
with the latest versions of Firefox, Opera and IE. Mac support is planned.

Finally I would like to thanks the Lucene and PyLucene team for making  
such wonderful software available. Also thanks for all the help you have  
provided in these discussion groups.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in Action hits desk in UK

2005-01-26 Thread aurora
On Wed, 26 Jan 2005 11:42:52 +, John Haxby [EMAIL PROTECTED] wrote:
My copy of Lucene in Action has finally hit my desk in the UK.   
Hopefully the dispatch time quoted by amazon.co.uk will now start to  
drop to something more sensible.

It's been interesting watching the price changes.  When I ordered my  
copy back in November, I paid £19.38 for it.  At around the time of  
publication, the price went up to £35.99, the list price.   It's  
currently priced at £25.19, 30% off list price.

jch
I noticed the price swing at Amazon too. I find the best price at  
www.bookpool.com at US$27.5. Great book by the way. Not just about using  
Lucene but touches on a wide range of background and related topics as  
well.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How to give recent documents a boost?

2005-01-25 Thread aurora
What is the best way to give recent documents a boost? Not sorting them by  
strict date order but to give them some preference. If document 1 filed  
last week has a score of 0.5 and document 2 filed last month has a score  
of 0.55, then list document 1 first. But if document 1 has a score of only  
0.05, then keep it at the end. Any experience of fine tuning by date order?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Chinese in Unicode !!!

2005-01-21 Thread aurora
I would love to give it a try. Please email me at aurora00 at gmail.com.  
Thanks!

Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some  
people actually said the StandardAnalyzer works better. I wonder what's  
the pros and cons.


I've written a Chinese Analyzer for Lucene that uses a segmenter written  
by
Erik Peterson. However, as the author of the segmenter does not want his  
code
released under apache open source license (although his code _is_
opensource), I cannot place my work in the Lucene Sandbox.  This is
unfortunate, because I believe the analyzer works quite well in indexing  
and
searching chinese docs in GB2312 and UTF-8 encoding, and I like more  
people
to test, use, and confirm this.  So anyone who wants it, can have it.  
Just
shoot me an email.
BTW, I also have written an arabic analyzer, which is collecting dust for
similar reasons.
Good luck,

Ali Safarnejad
-Original Message-
From: Eric Chow [mailto:[EMAIL PROTECTED]
Sent: 21 January 2005 11:42
To: Lucene Users List
Subject: Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!!
The following is the search result that I used the SearchFiles in the  
lucene
demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles idnex
Query: 
Searching for: g   
strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files  
contains
the 
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query:


From the above result only the ChineseDemo.html includes the character  
that I
want to search !


The modified code in SearchFiles.java:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
UTF-8));
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene and multiple languages

2005-01-20 Thread aurora
I'm trying to build some web search tool that could work for multiple  
languages. I understand that Lucene is shipped with StandardAnalyzer plus  
a German and Russian analyzers and some more in the sandbox. And that  
indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages and  
analyzed by an assortment of analyzers. When user enter a query, what  
analyzer should be used? Should the user be asked for the language  
upfront? What to expect when the analyzer and the document doesn't match?  
Let's said the query is parsed using StandardAnalyzer. Would it match any  
documents done in German analyzer at all. Or would it end up in poor  
result?

Also is there a good way to find out the languages used in a web page?  
There is a 'content-langage' header in http and a 'lang' attribute in  
HTML. Looks like people don't really use them. How can we recognize the  
language?

Even more interesting is multiple languages used in one document, let's  
say half English and half French. Is there a good way to deal with those  
cases?

Thanks for any guidance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: how often to optimize?

2004-12-28 Thread aurora
Are not optimized indices causing you any problems (e.g. slow searches,
high number of open file handles)?  If no, then you don't even need to
optimize until those issues become... issues.
OK I have changed the process to not doing optimize() at all. So far so  
good. The number of files hover from 10 to 40 during the indexing of  
10,000 files. Seems Lucene is doing some kind of self maintenance to keep  
things in order.

Is it right to say optimize() is a totally optional operation? I probably  
get the impression it is a natural step to end an incremental update from  
the IndexHTML example. Since it replicates the whole index it might be an  
overkill for many applications to do daily.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index size doubled?

2004-12-21 Thread aurora
Thanks for the heads up. I'm using Lucene 1.4.2.
I tried to do optimize() again but it has no effect. Adding a just tiny  
dummy document would get rid of it.

I'm doing optimize every few hundred documents because I tried to simulate  
incremental update. This lead to another question I would post separately.

Thanks.

Another possibility is that you are using an older version of Lucene,
which was known to have a bug with similar symptoms.  Get the latest
version of Lucene.
You shouldn't really have multiple .cfs files after optimizing your
index.  Also, optimize only at the end, if you care about indexing
speed.
Otis
--- Paul Elschot [EMAIL PROTECTED] wrote:
On Tuesday 21 December 2004 05:49, aurora wrote:
 I'm testing the rebuilding of the index. I add several hundred
documents,
 optimize and add another few hundred and so on. Right now I have
around
 7000 files. I observed after the index gets to certain size.
Everytime
 after optimize, the are two files roughly the same size like below:

 12/20/2004  01:57p  13 deletable
 12/20/2004  01:57p  29 segments
 12/20/2004  01:53p  14,460,367 _5qf.cfs
 12/20/2004  01:57p  15,069,013 _5zr.cfs

 The index total index is double of what I expect. This is not
always
 reproducible. (I'm constantly tuning my program and the set of
document).
 Sometime I get a decent single document after optimize. What was
happening?
Lucene tried to delete the older version (_5cf.cfs above), but got an
error
back from the file system. After that it has put the name of that
segment in
the deletable file, so it can try later to delete that segment.
This is known behaviour on FAT file systems. These randomly take some
time
for themselves to finish closing a file after it has been correctly
closed by
a program.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


how often to optimize?

2004-12-21 Thread aurora
Right now I am incrementally adding about 100 documents to the index a day  
and then optimize after that. I find that optimize essentially rebuilding  
the entire index into a single file. So the size of disk write is  
proportion to the total index size, not to the size of documents  
incrementally added.

So my question is would it be an overkill to optimize everyday? Is there  
any guideline on how often to optimize? Every 1000 documents or more?  
Every week? Is there any concern if there are a lot of documents added  
without optimizing?

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


index size doubled?

2004-12-20 Thread aurora
I'm testing the rebuilding of the index. I add several hundred documents,  
optimize and add another few hundred and so on. Right now I have around  
7000 files. I observed after the index gets to certain size. Everytime  
after optimize, the are two files roughly the same size like below:

12/20/2004  01:57p  13 deletable
12/20/2004  01:57p  29 segments
12/20/2004  01:53p  14,460,367 _5qf.cfs
12/20/2004  01:57p  15,069,013 _5zr.cfs
The index total index is double of what I expect. This is not always  
reproducible. (I'm constantly tuning my program and the set of document).  
Sometime I get a decent single document after optimize. What was happening?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


auto-generate uid?

2004-11-22 Thread aurora
Is there a way to auto-generate uid in Lucene? Even it is just a way to  
query the highest uid and let the application add one to it will do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: auto-generate uid?

2004-11-22 Thread aurora
Just to clarify. I have a Field 'uid' those value is an unique integer. I  
use it as a key to the document stored externally. I don't mean Lucene's  
internal document number.

I was wonder if there is a method to query the highest value of a field,  
perhaps something like:

  IndexReader.maxTerm('uid')

What would the purpose of an auto-generated UID be?
But no, Lucene does not generate UID's for you.  Documents are numbered  
internally by their insertion order.  This number changes, however, when  
documents are deleted in the middle and the index is optimized.

Erik
On Nov 22, 2004, at 1:50 PM, aurora wrote:
Is there a way to auto-generate uid in Lucene? Even it is just a way to  
query the highest uid and let the application add one to it will do.

Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


using lucene as a dictionary database?

2004-11-03 Thread aurora
Besides full text indexing, I need a database that represent a large
dictionary like:

  (key1, key2) - docid

I am considering between building a home grown solution and using
Berkeley DB. Then I think I was using Lucene anyway, wouldn't it make
sense use it as my database too? Just make key1 and key2 two keyword
fields and an UnIndexed field for docid?

I need to do something like

  get(key1, key2) - docid
  get(key1) - list of docid

This need to be fast

  add( list of (key1,key2,docid) )

This would be done perhaps once a day in a batch.

My experience with Lucene is its very efficient in terms of speed and
storage size. Would this be a right usage with Lucene?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]