RE: demo IndexHTML parser breaks unicode?

2004-09-24 Thread wallen
In org.apache.lucene.demo.HTMLDocument you need to change the input stream to use a different encoding. Replace the fis with this: fis = new InputStreamReader(new FileInputStream(f), "UTF-16"); -Original Message- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Friday, September 24, 2004

TopTerms on query results

2004-09-22 Thread wallen
Can anyone help me with code to get the topterms of a given field for a query resultset? Here is code modified from Luke to get the topterms for a field: public TermInfo[] mostCommonTerms( String fieldName, int numberOfTerms ) { //make sure min will get a positive number i

getting most common terms for a smaller set of documents

2004-09-07 Thread wallen
Dear Lucene Users: What is the best way to get the most common terms for a subset of the total documents in your index? I know how to get the most common terms for a field for the entire index, but what is the most efficient way to do this for a subset of documents? Here is the code I am using t

RE: Spam:too many open files

2004-09-07 Thread wallen
A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. I was unable to trace the problem to anything specific, but was using the newer code to take advantage

RE: Spam:too many open files

2004-09-07 Thread wallen
I sent out an email to this list a few weeks ago about how to fix a corrupt index. I basically edited the segments file with a hex editor removing the entry for the missing file and decremented the total count of files from the file count that is near the beginning of the segments file. -Orig

RE: Restoring a corrupt index

2004-08-17 Thread wallen
..Á´ -George --- Honey George <[EMAIL PROTECTED]> wrote: > Wallen, > Which hex editor have you used. I am also facing a > similar problem. I tried to use KHexEdit and it > doesn't seem to help. I am attaching with this e

RE: Restoring a corrupt index

2004-08-17 Thread wallen
http://www.ultraedit.com/ is the best! However, I cannot imagine how another hexeditor wouldnt work. -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 17, 2004 10:35 AM To: Lucene Users List Subject: RE: Restoring a corrupt index Wallen, Which hex

RE: Restoring a corrupt index

2004-08-16 Thread wallen
iter.optimize(IndexWriter.java:366) at TryStuff.tryFixingLuceneIndex(TryStuff.java:60) at TryStuff.main(TryStuff.java:49) -Directory listing- -rw-rw-r--1 wallen devs 383461 Jul 27 16:48 _1wtg.cfs -rw-rw-r--1 wallen devs 754131765 Jul 27 21:12 _262q

Restoring a corrupt index

2004-08-16 Thread wallen
-rw-rw-r--1 wallen devs 383461 Jul 27 16:48 _1wtg.cfs -rw-rw-r--1 wallen devs 754131765 Jul 27 21:12 _262q.cfs -rw-rw-r--1 wallen devs 754345785 Jul 29 11:43 _4c49.cfs -rw-rw-r--1 wallen devs 719608798 Jul 31 04:38 _6i6l.cfs -rw-rw-r--1

RE: Finding All?

2004-08-13 Thread wallen
A ranged query that covers the full range does the same thing. Of course it is also inefficient with term generation: myField[a TO z] -Original Message- From: Patrick Burleson [mailto:[EMAIL PROTECTED] Sent: Friday, August 13, 2004 3:58 PM To: Lucene Users List Subject: Re: Finding All

RE: Question on the minimum value for DateField

2004-08-04 Thread wallen
The date is stored as a Long that is the number of seconds since jan 1970. Anything before that would be negative. -Original Message- From: Terence Lai [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 04, 2004 6:25 PM To: Lucene Users List Subject: Question on the minimum value for DateFi

RE: TermFreqVector Beginner Question

2004-07-28 Thread wallen
Are you certain that you are storing the field "contents" in your documents, not just tokenizing... If you use the overloaded method that takes a Reader you lose the content. -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 5:35 PM To: [EMA

RE: Lucene vs. MySQL Full-Text

2004-07-22 Thread wallen
I also question whether it could handle extreme volume with such good query speed. Has anyone done numbers with 1+ million documents? -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 20, 2004 5:44 PM To: Lucene Users List Subject: Re: Lucene vs. MySQL F

RE: Very slow IndexReader.open() performance

2004-07-22 Thread wallen
It could also be that your disk space is filling up and the OS runs out of swap room. -Original Message- From: Mark Florence [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 20, 2004 1:52 PM To: Lucene Users List Subject: Very slow IndexReader.open() performance Hi -- We have a large index

RE: Searching against Database

2004-07-15 Thread wallen
If you know ahead of time which documents are viewable by a certain user group you could add a field, such as group, and then when you index the document you put the names of the user groups that are allowed to view that document. Then your query tool can append, for example "AND group:developers"

corrupt indexes?

2004-07-13 Thread wallen
Has anyone had any experience with their index getting corrupted? Are there any tools to repair it should it get corrupted? I have not had any problems, but was curious at how resiliant this data store seems to be. -Will - To u

RE: Field.java -> STORED, NOT_STORED, etc...

2004-07-12 Thread wallen
I have 2 suggestions: 1) use Eclipse, or an IDE that references the javadoc with mouseovers 2) if you are going to create constants, consider using a bitflag. Then your constants can have a 2's value, ie STORED = 1 INDEXED = 2 TOKENIZED = 4 Then you can have the constructor look like: new Fiel

RE: Problem with match on a non tokenized field.

2004-07-09 Thread wallen
I do not know how to work around that. It is indeed an interesting situation that would require more understanding as to how the analyzer (in this case NullAnalyzer) interacts with the special characters such as the * and ~. You could try using the whitespace analyzer instead of the nullanalyzer!

RE: Problem with match on a non tokenized field.

2004-07-08 Thread wallen
The PerFieldAnalyzerWrapper is constructed with your default analyzer, suppose this is the analyzer you use to tokenize. You then call the addAnalyzer method for each non-tokenized/keyword fields. In the case below, url is a keyword, all other fields are tokenized: PerFieldAnalyzerWrapper analyz

RE: Problem with match on a non tokenized field.

2004-07-07 Thread wallen
Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer("url", new NullAnalyzer()); try

QueryParser and Keyword Fields

2004-06-25 Thread wallen
Can anyone give me advice on the best way to not have your keyword fields analyzed by QueryParser? Even though it seems like it would be a common problem, I have read the FAQ, and found this relevant thread with no real answers. http://issues.apache.org/eyebrowse/[EMAIL PROTECTED] he.org&msgId=12

RE: Demo 3 on windows

2004-06-22 Thread wallen
use forward slashes / instead of \ for your path: c:/apache/group/index OR if c: is your main drive /apache/group/index -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Monday, June 21, 2004 5:55 PM To: [EMAIL PROTECTED] Subject: Demo 3 on windows Hello, I have bee

RE: search "" and ""

2004-06-18 Thread wallen
This depends on the analyzer you use. http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi ng&toc=faq#q13 -Original Message- From: Lynn Li [mailto:[EMAIL PROTECTED] Sent: Friday, June 18, 2004 5:03 PM To: '[EMAIL PROTECTED]' Subject: search "" and "" When search

RE: help needed in starting lucene

2004-06-02 Thread wallen
It sounds to me like you need a newer version of Java. -Original Message- From: milind honrao [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 02, 2004 5:36 PM To: [EMAIL PROTECTED] Subject: help needed in starting lucene Hi, I am just a beginner. I installed lucene according to the int

RE: Problem Indexing Large Document Field

2004-05-26 Thread wallen
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite r.html#DEFAULT_MAX_FIELD_LENGTH maxFieldLength public int maxFieldLengthThe maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that co

RE: Memory usage

2004-05-26 Thread wallen
This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are

RE: Performance profile of optimization...

2004-05-24 Thread wallen
My understanding is that hard drive IO is the main bottleneck, as the operation is mainly a file copy. So to directly answer your question, I believe the overall file size of your indexes will linearly effect the performance profile of your optimizations. -Original Message- From: Michael

RE: Rebuild after corruption

2004-05-21 Thread wallen
Make sure you close your indexwriter. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite r.html#close() -Original Message- From: Steve Rajavuori [mailto:[EMAIL PROTECTED] Sent: Friday, May 21, 2004 7:49 PM To: '[EMAIL PROTECTED]' Subject: Rebuild after corruption

RE: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-20 Thread wallen
I am not sure. See what google give you. I would guess you need to get a table of entities and compare it to the unicode character. So if you parse the word file you might see something like "&u12312;" (without quotes) this corresponds to a single unicode character and you can use the java api t

RE: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-20 Thread wallen
I believe MS apps store non-ascii characters as entities internally instead of using unicode. You can see evidence of this if you save your file as an HTML file and look at the source. You will have to adjust your parser to convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16)

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread wallen
Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPip

Can documents be appended to?

2004-05-17 Thread wallen
Is it possible to append to an existing document? Judging by my own tests and this thread, NO. http://issues.apache.org/eyebrowse/[EMAIL PROTECTED] he.org&msgNo=3971 Wouldn't it be possible to look up an individual document (based upon a uid of sorts), then load the Fields off of the old one, del