Re: Multiple indexes

2005-03-01 Thread Otis Gospodnetic
Ben,

You do need to use a separate instance of those 3 classes for each
index yes.  But this is really something like:

IndexWriter writer = new IndexWriter();

So it's normal code-writing process you don't really have to create
anything new, just use existing Lucene API.  As for locking, again you
don't need to create anything.  Lucene does have a locking mechanism,
but most of it should be completely invisible to you if you follow the
concurrency rules.

I hope this helps.

Otis

--- Ben [EMAIL PROTECTED] wrote:

 Is it true that for each index I have to create a seperate instance
 for FSDirectory, IndexWriter and IndexReader? Do I need to create a
 seperate locking mechanism as well?
 
 I have already implemented a program using just one index.
 
 Thanks,
 Ben
 
 On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher
 [EMAIL PROTECTED] wrote:
  It's hard to answer such a general question with anything very
 precise,
  so sorry if this doesn't hit the mark.  Come back with more details
 and
  we'll gladly assist though.
  
  First, certainly do not copy/paste code.  Use standard reuse
 practices,
  perhaps the same program can build the two different indexes if
 passed
  different parameters, or share code between two different programs
 as a
  JAR.
  
  What specifically are the issues you're encountering?
  
  Erik
  
  
  On Mar 1, 2005, at 8:06 PM, Ben wrote:
  
   Hi
  
   My site has two types of documents with different structure. I
 would
   like to create an index for each type of document. What is the
 best
   way to implement this?
  
   I have been trying to implement this but found out that 90% of
 the
   code is the same.
  
   In Lucene in Action book, there is a case study on jGuru, it just
   mentions them using multiple indexes. I would like to do
 something
   like them.
  
   Any resources on the Internet that I can learn from?
  
   Thanks,
   Ben
  
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ranking Terms

2005-02-26 Thread Otis Gospodnetic
Make sure you are not indexing your documents using the compound index
format (default in the newer versions of Lucene).  Then you will see
the .frq file.  Here is an example from one of Simpy's Lucene indices:

-rw-r--r--1 simpysimpy  629073 Feb 26 13:14 _1ao.frq

Otis
--
http://www.simpy.com

--- Daniel Cortes [EMAIL PROTECTED] wrote:

 Hi everybody,
 I need to found some documentation about the algorithms that lucene
 use 
 internally in the indexation and how it works with weights and 
 frequencies of the terms.This information will be used to know tastes
 of 
 my users and to relate users with the same interest and
 restlessness.:D
 I read something about .frq files but I don't have any frq life in my
 index.
 Thks.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Otis Gospodnetic
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, but I always hear people complaining about the speed.  A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.

Otis




--- Steven J. Owens [EMAIL PROTECTED] wrote:

 Hi,
 
  I was rambling to some friends about an idea to build a
 cache-aware JDBC driver wrapper, to make it easier to keep a lucene
 index of a database up to date.
 
  They asked me a question that I have to take seriously, which is
 that most RDBMSes provide some built-in fulltext searching -
 postgres,
 mysql, even oracle - why not use that instead of adding another layer
 of caching?
 
  I have to take this question seriously, especially since it
 reminds me a lot of what Doug has often said to folks contemplating
 doing similar things (caching query results, etc) with Lucene.
 
  Has anybody done some serious investigation into this, and could
 summarize the pros and cons?
 
 -- 
 Steven J. Owens
 [EMAIL PROTECTED]
 
 I'm going to make broad, sweeping generalizations and strong,
  declarative statements, because otherwise I'll be here all night and
  this document will be four times longer and much less fun to read.
  Take it all with a grain of salt. - http://darksleep.com/notablog
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-18 Thread Otis Gospodnetic
Or you could just open a new IndexSearcher, forget the old one, and
have GC collect it when everyone is done with it.

Otis

--- Chris Lamprecht [EMAIL PROTECTED] wrote:

 I should have mentioned, the reason for not doing this the obvious,
 simple way (just close the Searcher and reopen it if a new version is
 available) is because some threads could be in the middle of
 iterating
 through the search Hits.  If you close the Searcher they get a Bad
 file descriptor IOException.  As I found out the hard way :)
 
 
 On Fri, 18 Feb 2005 15:03:29 -0600, Chris Lamprecht
 [EMAIL PROTECTED] wrote:
  I recently dealt with the issue of re-using a Searcher with an
 index
  that changes often.  I wrote a class that allows my searching
 classes
  to check out a lucene Searcher, perform a search, and then return
  the Searcher.  It's similar to a database connection pool, except
 that
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document comparison

2005-02-18 Thread Otis Gospodnetic
Matt,

Erik and I have some code for this in Lucene in Action, but David
Spencer did this since the book was published:

  http://www.lucenebook.com/blog/announcements/more_like_this.html

Otis

--- Matt Chaput [EMAIL PROTECTED] wrote:

 Is there a simple, efficient way to compute similarity of documents 
 indexed with Lucene?
 
 My first, naive idea is to use the entire contents of one document as
 a 
 query to the second document, and use the score as a similarity 
 measurement. But I think I'm probably way off base with that.
 
 Can any IR pros set me straight? Thanks very much.
 
 Matt
 
 
 --
 Matt Chaput
 Word Monkey
 Side Effects Software Inc.
 
 A goddamned ray of sunshine all the goddamned time
 -- Sparkle Hayter
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-18 Thread Otis Gospodnetic
Yes, until it's cleaned up, and as soon as the last client is done with
Hits, the originating IndexSearcher is ready for cleanup if nobody else
is holding references to it.  You can close it explicityly, as you are
doing, too, no harm.

Otis

--- Chris Lamprecht [EMAIL PROTECTED] wrote:

 Wouldn't this leave open file handles?   I had a problem where there
 were lots of open file handles for deleted index files, because the
 old searchers were not being closed.
 
 On Fri, 18 Feb 2005 13:41:37 -0800 (PST), Otis Gospodnetic
 [EMAIL PROTECTED] wrote:
  Or you could just open a new IndexSearcher, forget the old one, and
  have GC collect it when everyone is done with it.
  
  Otis
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent searching re-indexing

2005-02-16 Thread Otis Gospodnetic
Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official
 business of this company are those solely of the author and should
 not be
 interpreted as being endorsed by this company.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Otis Gospodnetic
Hi,

lucene.apache.org seems to work now.
Here is the query syntax:
  http://lucene.apache.org/queryparsersyntax.html
[] is used as [BEGIN-RANGE-STRING TO END-RANGE-STRING]

Otis



--- Jim Lynch [EMAIL PROTECTED] wrote:

 First I'm getting a
 
 
 The requested URL could not be retrieved
 


 
 While trying to retrieve the URL: 

http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java
 
 
 
 The following error was encountered:
 
 Unable to determine IP address from host name for
 /lucene.apache.org
 
 /Guess the system is down.
 
 I'm getting this error:
 
 org.apache.lucene.queryParser.ParseException: Encountered is at
 line 
 1, column 15.
 Was expecting:
 ] ...
  when I tried to parse the following string [this is a test].
 
 I can't find any documentation that tells me what the brackets do to
 a 
 query.  I had a user that was used to another search engine that used
 [] 
 to do proximity or near searches and tried it on this one. Actually
 I'd 
 like to see the documentation for what the parser does.  All that is 
 mentioned in the javadoc is + - and ().  Obviously there are more 
 special characters.
 
 Thanks,
 Jim.
 
 Jim.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: behavioral differences between Field.Keyword and Field.UnStored

2005-02-11 Thread Otis Gospodnetic
The QueryParser is analyzing your Field.Keyword (genre field) fields,
because it doesn't know that genre is a Keyword field and should not be
analyzed.

Check section 4.4. here:
  http://www.lucenebook.com/search?query=queryparser+keyword

Otis
 

--- Mike Rose [EMAIL PROTECTED] wrote:

 Perhaps someone can explain something that seems to be a little weird
 to
 me.  I seem to be unable to search on fields of type Keyword.  The
 following snippet returns no hits..
 
  
 
 IndexWriter index = new IndexWriter(indexPath, new
 StandardAnalyzer(), true);
 
 
 
 Document doc = null;
 
 
 
 doc = new Document();
 
 doc.add(Field.Text(artist, Butthole Surfers));
 
 doc.add(Field.Keyword(genre,  Punk));
 
 doc.add(Field.Text(album,  Rembrandt Pussyhorse));
 
 index.addDocument(doc);
 
 
 
 doc = new Document();
 
 doc.add(Field.Text(artist, Ornette Coleman));
 
 doc.add(Field.Keyword(genre,  Jazz));
 
 doc.add(Field.Text(album,  Tomorrow is the Question));
 
 index.addDocument(doc);
 
 
 
 index.optimize();
 
 index.close();
 
 
 
 Searcher searcher = new IndexSearcher(indexPath);
 
 
 
 String expression = genre:punk;
 
 Query query = QueryParser.parse(expression, artist, new
 StandardAnalyzer());
 
  
 
 Hits hits = searcher.search(query);
 
 for (int i = 0; i  hits.length(); i++) {
 
 System.out.println(hits.doc(i));
 
 }
 
 
 
 searcher.close();
 
  
 
  
 
 However, if I change the genre field to be defined as Field.Text or
 Field.UnStored, I get the result I expect.  Can anyone offer any
 insight?
 
  
 
 Mike
 
  
 
  
 
  
 
  
 
  
 
 

 ATTACHMENT part 2 application/x-pkcs7-signature name=smime.p7s



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimize not deleting all files

2005-02-04 Thread Otis Gospodnetic
Get and try Lucene 1.4.3.  One of the older versions had a bug that was
not deleting old index files.

Otis

--- [EMAIL PROTECTED] wrote:

 Hi,
 
 When I run an optimize in our production environment, old index are
 left in the directory and are not deleted.  
 
 My understanding is that an
 optimize will create new index files and all existing index files
 should be
 deleted.  Is this correct?
 
 We are running Lucene 1.4.2 on Windows.  
 
 
 Any help is appreciated.  Thanks!
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numbers in the Query String

2005-02-03 Thread Otis Gospodnetic
Using different analyzers for indexing and searching is not
recommended.
Your numbers are not even in the index because you are using
StandardAnalyzer.  Use Luke to look at your index.

Otis


--- Hetan Shah [EMAIL PROTECTED] wrote:

 Hello,
 
 How can one search for a document based on the query which has
 numbers 
 in the query srting.
 
 e.g. query = Java 2 Platform J2EE
 
 What do I need to do so that the numbers do not get neglected.
 
 I am using StandardAnalyzer to index the pages and using StopAnalyzer
 to 
 search the documents. Would the use of two different analyzers cause
 any 
 trouble for the results?
 
 Thanks.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread Otis Gospodnetic
If you are not married to Java:
http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm

Otis

--- sergiu gordea [EMAIL PROTECTED] wrote:

 Karl Koch wrote:
 
 I am in control of the html, which means it is well formated HTML. I
 use
 only HTML files which I have transformed from XML. No external HTML
 (e.g.
 the web).
 
 Are there any very-short solutions for that?
   
 
 if you are using only correct formated HTML pages and you are in
 control 
 of these pages.
 you can use a regular exprestion to remove the tags.
 
 something like
 replaceAll(*,);
 
 This is the ideea behind the operation. If you will search on google
 you 
 will find a more robust
 regular expression.
 
 Using a simple regular expression will be a very cheap solution, that
 
 can cause you a lot of problems in the future.
  
  It's up to you to use it 
 
  Best,
  
  Sergiu
 
 Karl
 
   
 
 Karl Koch wrote:
 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking
 that a
   
 
 5kB
 
 
 code could actually do that. That sourceforge project is doing
 much more
 than that but I do not need it.
  
 
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
 size.
 
   You can use 3 lines of code with a good regular expresion to
 eliminate 
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 
 
 Karl
 
  
 
   
 
  Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
  Best,
 
   Sergiu
 
 Karl Koch wrote:
 

 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short
 and
   
 
 simple
 
 
 (KISS)) which allows to remove all HTML tags from HTML content?
 HTML
   
 
 3.2
 
 
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other
 structure
   
 
 but
 
 
 need a facility to clean up HTML into its normal underlying
 content
  
 
   
 
 before

 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo
 parser

 
 
 
 does

 
 
 
 simple mapping of HTML files into Lucene Documents; it does not
 give
 
 
 you
 
 

 
 
 
 a

 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of
 Xerces
 
 
 (uses
 
 
   
 

 
 
 
 the
 
 
  
 
   
 
 same API; will likely become part of Xerces), and so maps an
 HTML

 
 
 
 document

 
 
 
 into a full DOM that you can manipulate easily for a wide range
 of
 purposes.  I haven't used JTidy at an API level and so don't
 know it
 
 
 as
 
 
   
 

 
 
 
 well --
 
 
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML
 validation

 
 
 
 and

 
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents
 that go

 
 
 
 beyond

 
 
 
 indexing them in Lucene, and really like it.  It has been
 robust for
 
 
 me
 
 

 
 
 
 so

 
 
 
 far.
 
 Chuck
 
 
 
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
   
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   
 
 il_1g/
 
 
   
 

 
 
 

-

 
 
 
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
   
 
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
   
 

 
 
 
  
 
   
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]

RE: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Otis Gospodnetic
Adam,

Dawid posted some code that lets you use Carrot2 locally with Lucene,
without the componentized pipe line system described on Carrot2 site.

Otis

--- Adam Saltiel [EMAIL PROTECTED] wrote:

 David, Hi,
 Would you be able to comment on coincidentally recent thread  RE: -
 Grouping Search Results by Clustering Snippets:?
 Also, when I looked at Carrot2 the pipe line is implemented as over
 http. I
 wonder how efficient that is, or can it be changed, for instance for
 an all
 local implementation?
 Has Carrot2 been integrated in with Lucene, has it been used as the
 bases
 for a recommender system (could it be?)?
 TIA.
 
 Adam
 
  -Original Message-
  From: Dawid Weiss [mailto:[EMAIL PROTECTED]
  Sent: Monday, January 31, 2005 4:12 PM
  To: Lucene Users List
  Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
 
 
  Hi.
 
  Coming up with answers... a little belated, but hope you're still
 on:
 
   we have been experimenting with carrot2 and are very pleased so
 far,
   only one issue: there is no release not even an alpha one and the
   dependencies seemed to be patched (jama)
 
  Yes, there is not official release. We just don't feel the need
 to tag
  the sources with an official label because Carrot is not a
 stand-alone
  product (rather a library... or a framework). It does not imply
 that the
  project is in alpha stage... quite the contrary, in fact -- it has
 been
  out there for a while and it seems to do a good job for most
 people.
 
   is there any intentions to have any releases in the near future?
 
  I could tag a release even today if it makes you happy ;) But I
 hope I
  made the status of the project clear above.
 
  D.
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: total number of (unique) terms in the index

2005-01-28 Thread Otis Gospodnetic
I don't think there is a direct way to get the number of (unique) terms
in the index, so yes, I think you'll have to loop through TermEnum and
count.

Otis

--- Jonathan Lasko [EMAIL PROTECTED] wrote:

 I'm looking for the total number of unique terms in the index.  I see
 
 that I can get a TermEnum of all the terms in the index, but what is
 the 
 fastest way to get the total number of terms?
 
 Jonathan
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Loading a large index

2005-01-28 Thread Otis Gospodnetic
Edwin,

--- Edwin Tang [EMAIL PROTECTED] wrote:

 I have three indices really that I search via ParallelMultiSearcher.
 All three
 are being updated constantly. We would like to be able to perform a
 search on
 the indices and have the results reflect the latest documents
 indexed. However,
 that would mean I need to refresh my searcher. Because of the size
 of these,
 it's taking some time to load, and so search speed from the end user
 perspective seems slow. What can I do to minimize or do away with the
 time it
 takes to loading a new searcher... from the end user perspective that
 is?

How up-to-date do these searches have to be?  If they don't have to be
exactly up to date you could periodically re-create the IndexSearcher,
instead of checking for a ne ndex version on every search.  I think 
person from Moreover.com posted some code that may be relevant.  Maybe
3-4 months ago, maybe 6...it had to do with re-reading the index in the
background for sorting purposes, if I recall correctly.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Disk space used by optimize

2005-01-28 Thread Otis Gospodnetic
Morus,

that description of 3 sets of index files is what I was imagining, too.
 I'll have to test and add to the book errata, it seems.

Thanks for the info,
Otis

--- Morus Walter [EMAIL PROTECTED] wrote:

 Otis Gospodnetic writes:
  Hello,
  
  Yes, that is how optimize works - copies all existing index
 segments
  into one unified index segment, thus optimizing it.
  
  see hit #1:
 http://www.lucenebook.com/search?query=optimize+disk+space
  
  However, three times the space sounds a bit too much, or I make a
  mistake in the book. :)
  
 I cannot explain why, but ~ three times the size of the final index
 is
 what I observed, when I logged disk usage during optimize of an index
 in compound index format.
 The test was on linux, I simply did a 'du -s' every few seconds
 parallel 
 to the optimize.
 I didn't test noncompund format. Probably optimizing a compund format
 requires to store the different parts of the compound file separately
 before joining them to the compound file (sound reasonable, otherwise
 you would need to know the sizes before creating the parts). In that
 case 
 you had the original index, the separate files and the new compound
 file 
 as the disk usage peak.
 
 So IMHO the book is wrong.
 
 Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action hits desk in UK

2005-01-28 Thread Otis Gospodnetic
Hello,

I've asked the publisher ( http://www.manning.com ) yesterday.  I don't
know about the exact stores, but apparently they do have a distributor
in Singapore, so you should be able to find Lucene in Action there
soon.

Otis

--- jac jac [EMAIL PROTECTED] wrote:

 
 Just wondering:
 
 Is Lucene-in-Action being sold anywhere in Singapore?
 
  
 
 thanks!
 
 
 
 Otis Gospodnetic [EMAIL PROTECTED] wrote: Gospodnetiæ
 sounds like Gospodnetich and Eric is Erik :)
 
 Otis
 
 --- John Haxby wrote:
 
  Otis Gospodnetic wrote:
  
  I contacted both the US and UK Amazon sites and asked them to fix
 my
  last name (the last character in my name has a little slash (not
 an
  accent) above it), but they never bothered to fix it nor email me
  back.
  
   
  
  They probably don't know how to type a ć. How _do_ you pronounce
  your 
  name? I've no idea what to do with that mark over the final c! At
  the 
  moment it's Lucene in Action by Eric Hatcher and Otis 
  Gospo-something-or-other :-)
  
  Anyhow, enjoy Lucene in Action!
   
  
  Already doing so!
  
  jch



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Different Documents (with fields) in one index?

2005-01-27 Thread Otis Gospodnetic
Karl,

This is completely fine.  You can have documents with different fields
in the same index.

Otis

--- Karl Koch [EMAIL PROTECTED] wrote:

 Hello all,
 
 perhaps not such a sophisticated question: 
 
 I would like to have a very diverse set of documents in one index.
 Depending
 on the inside of text documents, I would like to put part of the text
 in
 different fields. This means in the searches, when searching a
 particular
 field, some of those documents won't be addressed at all.
 
 Is it possible to have different kinds of Documents with different
 index
 fields in ONE index? Or do I need one index for each set?
 
 Karl
 
 -- 
 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
 +++ GMX - die erste Adresse für Mail, Message, More +++
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting Questions

2005-01-27 Thread Otis Gospodnetic
Luke,

Boosting is only one of the factors involved in Document/Query scoring.
 Assuming that by applying your boosts to Document A or a single field
of Document A increases the total score enough, yes, that Document A
may have the highest score.  But just because you boost a single
Document and not others, it does not mean it will emerge at the top.
You should check out the Explanation class, which can dump all scoring
factors in text or HTML format.

Otis


--- Luke Shannon [EMAIL PROTECTED] wrote:

 Hi All;
 
 I just want to make sure I have the right idea about boosting.
 
 So if I boost a document (Document A) after I index it (lets say a
 score of
 2.0) Lucene will now consider this document relativly more important
 than
 other documents in the index with a boost factor less than 2.0. This
 boost
 factor will also be applied to all the fields in the Document A.
 Therefore,
 if I do a TermQuery on a field that all my documents share (title),
 in the
 returned Hits (assuming Document A was among the return documents),
 Document
 A will score higher than other documents with a lower boost factor
 because
 the title field in A would have been boosted with all its other
 fields.
 Correct?
 
 Now if at indexing time I decided to boost a particular field, lets
 say
 address in Document A (this is a field which all documents have)
 the boost
 factor is only applied to the address field of Document A. Nothing
 else is
 boosted by this operation. This means if a TermQuery on the address
 field
 returns Document A along with a collection of other documents,
 Document A
 will score higher than the others because of boosting. Correct?
 
 Thanks,
 
 Luke
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML index

2005-01-27 Thread Otis Gospodnetic
Hello Karl,

Grab the source code for Lucene in Action, it's got code that parses
and indexes XML with DOM and SAX.  You can see the coverage of that
stuff here: 
http://lucenebook.com/search?query=indexing+XML+section%3A7*
I haven't used kXML, but I imagine the LIA code should get you going
quickly and you are free to adapt the code to work with kXML for you.

Otis

--- Karl Koch [EMAIL PROTECTED] wrote:

 Hi,
 
 I want to use kXML with Lucene to index XML files. I think it is
 possible to
 dynamically assign Node names as Document fields and Node texts as
 Text
 (after using an Analyser). 
 
 I have seen some XML indexing in the Sandbox. Is anybody here which
 has done
 something with a thin pull parser (perhaps even kXML)? Does anybody
 know of
 a project or some sourcecode available which covers this topic?
 
 Karl
 
  
 
 -- 
 Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
Hello,

Yes, that is how optimize works - copies all existing index segments
into one unified index segment, thus optimizing it.

see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space

However, three times the space sounds a bit too much, or I make a
mistake in the book. :)

You said you end up with 3 files - .cfs is one of them, right?

Otis


--- Kauler, Leto S [EMAIL PROTECTED] wrote:

 
 Just a quick question:  after writing an index and then calling
 optimize(), is it normal for the index to expand to about three times
 the size before finally compressing?
 
 In our case the optimise grinds the disk, expanding the index into
 many
 files of about 145MB total, before compressing down to three files of
 about 47MB total.  That must be a lot of disk activity for the people
 with multi-gigabyte indexes!
 
 Regards,
 Leto
 
 CONFIDENTIALITY NOTICE AND DISCLAIMER
 
 Information in this transmission is intended only for the person(s)
 to whom it is addressed and may contain privileged and/or
 confidential information. If you are not the intended recipient, any
 disclosure, copying or dissemination of the information is
 unauthorised and you should delete/destroy all copies and notify the
 sender. No liability is accepted for any unauthorised use of the
 information contained in this transmission.
 
 This disclaimer has been automatically added.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
I discuss this with myself a lot inside my head... :)
Seriously, I agree with Erik.  I think this is a business opportunity.
How many people are hating me now and going shh?  Raise your
hands!

Otis

--- David Spencer [EMAIL PROTECTED] wrote:

 This reminds me, has anyone every discussed something similar:
 
 - rackmount server ( or for coolness factor, that mini mac)
 - web i/f for config/control
 
 - of course the server would have the following s/w:
 -- web server
 -- lucene / nutch
 
 Part of the work here I think is having a decent web i/f to configure
 
 the thing and to customize the LF of the search results.
 
 
 
 jian chen wrote:
  Hi,
  
  I was searching using google and just found that there was a new
  feature called google mini. Initially I thought it was another
 free
  service for small companies. Then I realized that it costs quite
 some
  money ($4,995) for the hardware and software. (I guess the
 proprietary
  software costs a whole lot more than actual hardware.)
  
  The nice feature is that, you can only index up to 50,000
 documents
  with this price. If you need to index more, sorry, send in the
  check...
  
  It seems to me that any small biz will be ripped off if they
 install
  this google mini thing, compared to using Lucene to implement a
 easy
  to use search software, which could search up to whatever number of
  documents you could image.
  
  I hope the lucene project could get exposed more to the enterprise
 so
  that people know that they have not only cheaper but more
 importantly,
  BETTER alternatives.
  
  Jian
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
Have you tried using the multifile index format?  Now I wonder if there
is actually a difference in disk space cosumed by optimize() when you
use multifile and compound index format...

Otis

--- Kauler, Leto S [EMAIL PROTECTED] wrote:

 Our copy of LIA is in the mail ;)
 
 Yes the final three files are: the .cfs (46.8MB), deletable (4
 bytes),
 and segments (29 bytes).
 
 --Leto
 
 
 
  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
  
  Hello,
  
  Yes, that is how optimize works - copies all existing index 
  segments into one unified index segment, thus optimizing it.
  
  see hit #1:
 http://www.lucenebook.com/search?query=optimize+disk+space
  
  However, three times the space sounds a bit too much, or I 
  make a mistake in the book. :)
  
  You said you end up with 3 files - .cfs is one of them, right?
  
  Otis
  
  
  --- Kauler, Leto S [EMAIL PROTECTED] wrote:
  
   
   Just a quick question:  after writing an index and then calling 
   optimize(), is it normal for the index to expand to about 
  three times 
   the size before finally compressing?
   
   In our case the optimise grinds the disk, expanding the index
 into 
   many files of about 145MB total, before compressing down to three
 
   files of about 47MB total.  That must be a lot of disk activity
 for 
   the people with multi-gigabyte indexes!
   
   Regards,
   Leto
 
 CONFIDENTIALITY NOTICE AND DISCLAIMER
 
 Information in this transmission is intended only for the person(s)
 to whom it is addressed and may contain privileged and/or
 confidential information. If you are not the intended recipient, any
 disclosure, copying or dissemination of the information is
 unauthorised and you should delete/destroy all copies and notify the
 sender. No liability is accepted for any unauthorised use of the
 information contained in this transmission.
 
 This disclaimer has been automatically added.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
500 times the original data?  Not true! :)

Otis

--- Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote:

 Hi,
  
 I agree that Google mini is quite expensive.  It might be similar to
 the desktop version in quality.  Anyone knows google's ratio of index
 to text?   Is it true that Lucene's index is about 500 times the
 original text size (not including image size)?  I don't have one
 installed, so I cannot measure.
  
 Best,
  
 Sharon
 
 jian chen [EMAIL PROTECTED] wrote:
 Hi,
 
 I was searching using google and just found that there was a new
 feature called google mini. Initially I thought it was another free
 service for small companies. Then I realized that it costs quite some
 money ($4,995) for the hardware and software. (I guess the
 proprietary
 software costs a whole lot more than actual hardware.)
 
 The nice feature is that, you can only index up to 50,000 documents
 with this price. If you need to index more, sorry, send in the
 check...
 
 It seems to me that any small biz will be ripped off if they install
 this google mini thing, compared to using Lucene to implement a easy
 to use search software, which could search up to whatever number of
 documents you could image.
 
 I hope the lucene project could get exposed more to the enterprise so
 that people know that they have not only cheaper but more
 importantly,
 BETTER alternatives.
 
 Jian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action hits desk in UK

2005-01-26 Thread Otis Gospodnetic
Publisher - Amazon information feed seems to be a fairly manual
process, and Amazon takes a while to update book information on their
site, including prices.

I contacted both the US and UK Amazon sites and asked them to fix my
last name (the last character in my name has a little slash (not an
accent) above it), but they never bothered to fix it nor email me back.

Anyhow, enjoy Lucene in Action!

Otis Gospodneti#263;

--- John Haxby [EMAIL PROTECTED] wrote:

 
 My copy of Lucene in Action has finally hit my desk in the UK.  
 Hopefully the dispatch time quoted by amazon.co.uk will now start to 
 drop to something more sensible.
 
 It's been interesting watching the price changes.  When I ordered my 
 copy back in November, I paid £19.38 for it.  At around the time of 
 publication, the price went up to £35.99, the list price.   It's 
 currently priced at £25.19, 30% off list price.
 
 jch
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting Into Search

2005-01-26 Thread Otis Gospodnetic
Hi Luke,

That's not hard with RangeQuery (supported by QueryParser), take a look
at this:
  http://www.lucenebook.com/search?query=date+range

The grayed-out text has the section name and page number, so you can
quickly locate this stuff in your ebook.

Otis
P.S.
Do you know if Indigo/Chapters has Lucene in Action on their book
shelves yet?


--- Luke Shannon [EMAIL PROTECTED] wrote:

 Hello;
 
 My lucene application has been performing well in our company's CMS
 application. The plan now is too offer advanced searching.
 
 I just bought the eBook version of Lucene in Action to help with my
 research
 (it is taking Amazon for ever to ship the printed version to Canada).
 
 The book looks great and will certainly deepen my understanding. But
 I am
 suffering a bit of information over load.
 
 I was hoping I could post the rough requirments I was given this
 morning and
 perhaps some more experienced Luceners could help direct my research
 (this
 can even be pointing me to relevant sections of the book).
 
 1. Documents in the system contain the following fields,
 ModificationDate,
 CreationDate. A query is required that allows users to search for
 documents
 created/modified on a certain date or within a certain date range.
 
 2. Documents in the system also contains fields: Title, Path. A query
 is
 required that allows users to search for Titles or Path starting
 with,
 ending with, containing (this is all the system currently does) or
 matching
 specific term(s).
 
 Later today I will get more specific requirments. For now I am
 looking
 through Analysis section of the eBook for ideas on how to handle
 this. Any
 tips anyone can give would be appreciated.
 
 Thanks,
 
 Luke
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action hits desk in UK

2005-01-26 Thread Otis Gospodnetic
Gospodneti#263; sounds like Gospodnetich and Eric is Erik :)

Otis

--- John Haxby [EMAIL PROTECTED] wrote:

 Otis Gospodnetic wrote:
 
 I contacted both the US and UK Amazon sites and asked them to fix my
 last name (the last character in my name has a little slash (not an
 accent) above it), but they never bothered to fix it nor email me
 back.
 
   
 
 They probably don't know how to type a ć.  How _do_ you pronounce
 your 
 name?  I've no idea what to do with that mark over the final c!  At
 the 
 moment it's Lucene in Action by Eric Hatcher and Otis 
 Gospo-something-or-other :-)
 
 Anyhow, enjoy Lucene in Action!
   
 
 Already doing so!
 
 jch
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search on heterogenous index

2005-01-25 Thread Otis Gospodnetic
Hello Simeon,

Heterogenous Documents/indices are OK - check out the second hit:

  http://www.lucenebook.com/search?query=heterogenous+different

Otis

--- Simeon Koptelov [EMAIL PROTECTED] wrote:

 Hello all. I'm new to lucene and think about using it in my project.
 
 I have prices with dynamic structure, containing wares there, about
 10K prices 
 with total 500K wares. Each price has about 5 text fields. 
 
 I'll do searches on wares. The difficult part is that I'll do
 searches for all 
 wares, the search is not bound to a particular price structure.
 
 My question is, how should I organize my indices? Can Lucene handle
 data 
 effectlively if I'll have one index containing different Fields in
 Documents? 
 Or should I create a separate index for each price with same Fields
 structure 
 across Documents?
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-25 Thread Otis Gospodnetic
I don't have a document with chinese characters to verify this, but it
looks right, so I'll add your change to SearchFiles.java.

Thanks,
Otis

--- Eric Chow [EMAIL PROTECTED] wrote:

 Search not really correct with UTF-8 !!!
 
 
 The following is the search result that I used the SearchFiles in the
 lucene demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: 經
 Searching for: g 
 strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files
 contains the 經
-
 1. ../docs/luceneplan.html
- Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
- Index (Lucene 1.4.3 API)
 Query: 
 
 
 
 From the above result only the ChineseDemo.html includes the
 character
 that I want to search !
 
 
 
 
 The modified code in SearchFiles.java:
 
 
 BufferedReader in = new BufferedReader(new
 InputStreamReader(System.in, UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: English and French documents together / analysis, indexing, searching

2005-01-23 Thread Otis Gospodnetic
That would be a partial solution.  Accents will not be a problem any
more, but if you use an Analyzer than stems tokens, they will not rally
be tokenized properly.  Searches will probably work, but if you look at
the index you will see that some terms were not analyzed properly.  But
it may be sufficient for your needs, so try just with accent removal.

Otis


--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Morus Walter said the following on 1/21/2005 2:14 AM:
 
  No. You could do a ( ( french-query ) or ( english-query ) )
 construct 
  using
 
 one query. So query construction would be a bit more complex but
 querying
 itself wouldn't change.
 
 The first thing I'd do in your case would be to look at the
 differences
 in the output of english and french snowball stemmer.
 I don't speak any french, but probably you might even use both
 stemmers
 on all texts.
 
 Morus
 
 
 I've done some thinking afterwards, and instead of messing with
 complex 
 queries, would it make sense to
 replace all special characters such as é, è with e during 
 indexing (I suppose write a custom analyzer)
 and then during searching parse the query and replace all occurances
 of 
 special characters (if any) with their
 normal latin equivalents?
 
 This should produce the required results, no? Since the index would
 not 
 contain any French characters and
 searching for French words would return them since they were indexed
 as 
 normal words.
 
 -pedja
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: keep indexes as files or save them in database

2005-01-23 Thread Otis Gospodnetic
A number of people have tried putting Lucene indices in RDBMS.  As far
as I know, all were slower than FSDirectory.

Otis

--- nafise hassani [EMAIL PROTECTED] wrote:

 Hi
 I want to know from the performance point of view it
 is better to save lucene indexes in database or use
 them as files???
 suggestion??
 best regards  
 
 
   
 __ 
 Do you Yahoo!? 
 All your favorites on one personal page – Try My Yahoo!
 http://my.yahoo.com 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
It would be interesting to know _what_exactly_ uses your memory. 
Running under an optimizer should tell you that.

The only thing that comes to mind is... can't remember the details now,
but when the index is opened, I believe every 128th term is read into
memory.  This, I believe, helps with index seeks at search time.  I
wonder if this is what's using your memory.  The number '128' can't be
modified just like that, but somebody (Julien?) has modified the code
in the past to make this variable.  That's the only thing I can think
of right now and it may or may not be an idea in the right direction.

Otis


--- Kevin A. Burton [EMAIL PROTECTED] wrote:
 We have one large index right now... its about 60G ... When I open it
 
 the Java VM used 940M of memory.  The VM does nothing else besides
 open 
 this index.
 
 Here's the code:
 
 System.out.println( opening... );
 
 long before = System.currentTimeMillis();
 Directory dir = FSDirectory.getDirectory( 
 /var/ksa/index-1078106952160/, false );
 IndexReader ir = IndexReader.open( dir );
 System.out.println( ir.getClass() );
 long after = System.currentTimeMillis();
 System.out.println( opening...done - duration:  + 
 (after-before) );
 
 System.out.println( totalMemory:  + 
 Runtime.getRuntime().totalMemory() );
 System.out.println( freeMemory:  + 
 Runtime.getRuntime().freeMemory() );
 
 Is there any way to reduce this footprint?  The index is fully 
 optimized... I'm willing to take a performance hit if necessary.  Is 
 this documented anywhere?
 
 Kevin
 
 -- 
 
 Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
 
 invite!  Also see irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
 If you're interested in RSS, Weblogs, Social Networking, etc... then
 you 
 should work for Rojo!  If you recommend someone and we hire them
 you'll 
 get a free iPod!
 
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
There Kevin, that's what I was referring to, the .tii file.

Otis

--- Paul Elschot [EMAIL PROTECTED] wrote:

 On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
  Kevin A. Burton wrote:
  
   We have one large index right now... its about 60G ... When I
 open it 
   the Java VM used 940M of memory.  The VM does nothing else
 besides 
   open this index.
  
  After thinking about it I guess 1.5% of memory per index really
 isn't 
  THAT bad.  What would be nice if there was a way to do this from
 disk 
  and then use the a buffer (either via the filesystem or in-vm
 memory) to 
  access these variables.
 
 It's even documented. From:
 http://jakarta.apache.org/lucene/docs/fileformats.html :
 
 The term info index, or .tii file. 
 This contains every IndexIntervalth entry from the .tis file, along
 with its
 location in the tis file. This is designed to be read entirely
 into memory
 and used to provide random access to the tis file. 
 
 My guess is that this is what you see happening.
 To see the actuall .tii file, you need the non default file format.
 
 Once searching starts you'll also see that the field norms are
 loaded,
 these take one byte per searched field per document.
 
  This would be similar to the way the MySQL index cache works...
 
 It would be possible to add another level of indexing to the terms.
 No one has done this yet, so I guess it's prefered to buy RAM
 instead...
 
 Regards,
 Paul Elschot
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action

2005-01-22 Thread Otis Gospodnetic
Hi Ansi,

If you want the print version, I would guess you could order it from
the publisher (http://www.manning.com/hatcher2) or from Amazon and they
will ship it to you in China.  The electronic version (a PDF file) is
also available from the above URL.

I'll ask Manning Publications and see whether they ship outside the
U.S.

Otis


--- ansi [EMAIL PROTECTED] wrote:

 hi,all
 
 Does anyone know how to buy Lucene in Action in China?
 
 Ansi
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
Yes, I remember your email about the large number of Terms.  If it can
be avoided and you figure out how to do it, I'd love to patch
something. :)

Otis

--- Kevin A. Burton [EMAIL PROTECTED] wrote:

 Otis Gospodnetic wrote:
 
 It would be interesting to know _what_exactly_ uses your memory. 
 Running under an optimizer should tell you that.
 
 The only thing that comes to mind is... can't remember the details
 now,
 but when the index is opened, I believe every 128th term is read
 into
 memory.  This, I believe, helps with index seeks at search time.  I
 wonder if this is what's using your memory.  The number '128' can't
 be
 modified just like that, but somebody (Julien?) has modified the
 code
 in the past to make this variable.  That's the only thing I can
 think
 of right now and it may or may not be an idea in the right
 direction.
   
 
 I loaded it into a profiler a long time ago. Most of the code was due
 to 
 Term classes being loaded into memory.
 
 I might try to get some time to load it into a profiler on monday...
 
 Kevin
 
 -- 
 
 Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
 
 invite!  Also see irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
 If you're interested in RSS, Weblogs, Social Networking, etc... then
 you 
 should work for Rojo!  If you recommend someone and we hire them
 you'll 
 get a free iPod!
 
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-21 Thread Otis Gospodnetic
Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- Kevin L. Cobb [EMAIL PROTECTED] wrote:

 I want to understand how Lucene uses stemming but can't find any
 documentation on the Lucene site. I'll continue to google but hope
 that
 this list can help narrow my search. I have several questions on the
 subject currently but hesitate to list them here since finding a good
 document on the subject may answer most of them. 
 
  
 
 Thanks in advance for any pointers,
 
  
 
 Kevin
 
  
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Filtering w/ Multiple Terms

2005-01-21 Thread Otis Gospodnetic
This:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html
?

You can control that limit via
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount

Otis


--- Jerry Jalenak [EMAIL PROTECTED] wrote:

 OK.  But isn't there a limit on the number of BooleanQueries that can
 be
 combined with AND / OR / etc?
 
 
 
 Jerry Jalenak
 Senior Programmer / Analyst, Web Publishing
 LabOne, Inc.
 10101 Renner Blvd.
 Lenexa, KS  66219
 (913) 577-1496
 
 [EMAIL PROTECTED]
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Thursday, January 20, 2005 5:05 PM
  To: Lucene Users List
  Subject: Re: Filtering w/ Multiple Terms
  
  
  
  On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote:
  
   In looking at the examples for filtering of hits, it looks 
  like I can 
   only
   specify a single term; i.e.
  
 Filter f = new QueryFilter(new TermQuery(new Term(acct,
   acct1)));
  
   I need to specify more than one term in my filter.  Short of
 using 
   something
   like ChainFilter, how are others handling this?
  
  You can make as complex of a Query as you want for 
  QueryFilter.  If you 
  want to filter on multiple terms, construct a BooleanQuery 
  with nested 
  TermQuery's, either in an AND or OR fashion.
  
  Erik
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 This transmission (and any information attached to it) may be
 confidential and
 is intended solely for the use of the individual or entity to which
 it is
 addressed. If you are not the intended recipient or the person
 responsible for
 delivering the transmission to the intended recipient, be advised
 that you
 have received this transmission in error and that any use,
 dissemination,
 forwarding, printing, or copying of this information is strictly
 prohibited.
 If you have received this transmission in error, please immediately
 notify
 LabOne at the following email address:
 [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Hi Ranjan,

It sounds like you are should look at and use Nutch:
http://www.nutch.org

Otis

--- Ranjan K. Baisak [EMAIL PROTECTED] wrote:

 I am planning to move to Lucene but not have much
 knowledge on the same. The search engine which I had
 developed is searching some extranet URLs e.g.
 codeguru.com/index.html. Is is possible to get the
 same functionality using Lucene. So basically can I
 make Lucene as a search engine to search extranets.
 
 regards,
 Ranjan
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent read and write

2005-01-21 Thread Otis Gospodnetic
Hello Ashley,

You can read/search while modifying the index, but you have to ensure
only one thread or only one process is modifying an index at any given
time.  Both IndexReader and IndexWriter can be used to modify an index.
 The former to delete Documents and the latter to add them.  You have
to ensure these two operations don't overlap.
c.f. http://www.lucenebook.com/search?query=concurrent

Otis


--- Ashley Steigerwalt [EMAIL PROTECTED] wrote:

 I am a little fuzzy on the thread-safeness of Lucene, or maybe just
 java.  
 From what I understand, and correct me if I'm wrong, Lucene takes
 care of 
 concurrency issues and it is ok to run a query while writing to an
 index.
 
 My question is, does this still hold true if the reader and writer
 are being 
 executed as separate programs?  I have a cron job that will update
 the index 
 periodically.  I also have a search application on a web form.  Is
 this going 
 to cause trouble if someone runs a query while the indexer is
 updating?
 
 Ashley
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net,
etc.), we should link to them from one of the Lucene pages where we
link to related external tools, apps, and such.

Otis


--- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote:

 I've written a Chinese Analyzer for Lucene that uses a segmenter
 written by
 Erik Peterson. However, as the author of the segmenter does not want
 his code
 released under apache open source license (although his code _is_
 opensource), I cannot place my work in the Lucene Sandbox.  This is
 unfortunate, because I believe the analyzer works quite well in
 indexing and
 searching chinese docs in GB2312 and UTF-8 encoding, and I like more
 people
 to test, use, and confirm this.  So anyone who wants it, can have it.
 Just
 shoot me an email.
 BTW, I also have written an arabic analyzer, which is collecting dust
 for
 similar reasons.
 Good luck,
 
 Ali Safarnejad
 
 
 -Original Message-
 From: Eric Chow [mailto:[EMAIL PROTECTED] 
 Sent: 21 January 2005 11:42
 To: Lucene Users List
 Subject: Re: Search Chinese in Unicode !!!
 
 
 Search not really correct with UTF-8 !!!
 
 
 The following is the search result that I used the SearchFiles in the
 lucene
 demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: å´
 Searching for: g 
 strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files
 contains
 the å´
-
 1. ../docs/luceneplan.html
- Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
- Index (Lucene 1.4.3 API)
 Query: 
 
 
 
 From the above result only the ChineseDemo.html includes the
 character that I
 want to search !
 
 
 
 
 The modified code in SearchFiles.java:
 
 
 BufferedReader in = new BufferedReader(new
 InputStreamReader(System.in,
 UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: help in indexing

2005-01-20 Thread Otis Gospodnetic
Hello Chetan,

The code that comes with the Lucene book contains a little framework
for indexing rich-text documents.  It sounds like you may be able to
use it as-is, and extending it with a parser for Excel files, which we
didn't include in the code (whould we include it in the next edition?).
 While PDFBox comes with that handy Lucene-specific class that you are
using, it may be better for you to be in control of how exactly you
construct your Lucene documents.
c.f. http://www.lucenebook.com/search?query=framework

Otis

--- chetan minajagi [EMAIL PROTECTED] wrote:

 Hi Karthik/Cocula,
 
 Luke didn't work but Limo helped.I seem to get results when i use
 Limo for my text/xls files.
 Now the problem with pdf search
 The problem that i see is the summary field as seen through LIMO is
 not indexed and hence no hits.
 I'm using the default document got by 
  LucenePDFDocument.getDocument(myPdfFile);
 So how do i ensure that a few of the fields in this which are not
 indexed are set to indexed.
 As far as I can see I can only probe whether a field is indexed or
 not by using 
 Field.isIndexed() but is there a method by which i can set to
 indexed.
 can someone provide any help or pointers in this regard?
  
 Thanks  Regards,
 Chetan
 
 Karthik N S [EMAIL PROTECTED] wrote:
 Hi
 
 Probably u need to use the Luke S/w to peek insid tu'r Indexer,Use it
 then
 come back for more help
 
 
 Karthik
 
 
 -Original Message-
 From: chetan minajagi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, January 20, 2005 12:05 PM
 To: lucene-user@jakarta.apache.org
 Subject: help in indexing
 
 
 Hi ,
 
 It might seem elementary to most of you.
 I am trying to build a search tool for internal use using lucene.
 I have used the following
 for
 .pdf -- PDFBOx
 .html -- demo file of lucene(HTMLDocument)
 .xls -- poi
 
 The indexing seems to work without throwing up any errors.
 But,when i try to search i end up getting with zero hits always.
 I have tried to use the same string that i see
 (System.out.print(Document))
 but in vain.
 Can somebody let me know where and what could be wrong.
 Regards,
 Chetan
 
 
 -
 Do you Yahoo!?
 Yahoo! Search presents - Jib Jab's 'Second Term'
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
   
 -
 Do you Yahoo!?
  Yahoo! Mail - You care about security. So do we.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene2.0 and transaction support

2005-01-20 Thread Otis Gospodnetic
The Wiki has some info about Lucene 2.0, but that is all there is about
2.0.

Regarding transactions - have you tried DbDirectory?  I believe that
will provide XA support and it won't require Lucene changes.

Otis


--- John Wang [EMAIL PROTECTED] wrote:

 Hi:
 
When is lucene 2.0 scheduled to be released? Is there a javadoc
 somewhere so we can check out the new APIs?
 
 Is there a plan to add transaction support into lucene? This is
 something we need and if we do implement it ourselves, is it too
 large
 of a change for a patch?
 
 Thanks
 
 -John
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Closed IndexWriter reuse

2005-01-20 Thread Otis Gospodnetic
No, you can't add documents to an index once you close the IndexWriter.
You can re-open the IndexWriter and add more documents, of course.

Otis

--- Oscar Picasso [EMAIL PROTECTED] wrote:

 Hi,
 
 Is it safe to add documents to an IndexWriter that has been closed? 
 
 From what I have seen, the close method flush the changes, closes the
 files but
 it creates new files allowing to add new documents.
 
 Am I right?
 
 Thanks.
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Easier than ever with enhanced search. Learn more.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why IndexReader.lastModified(index) is depricated?

2005-01-19 Thread Otis Gospodnetic
Going for the segments file like that is not a recommended practise, or
at least not something I'd recommend.  'segments' file is really
something that a caller should not know anything about.  Once day
Lucene may choose to rename the segments file or some such, and the
code that uses this trick will break.

To answer the original question, yes, I think it would be handy to have
this method back.  Perhaps we should revive it/them, ha?

Otis


--- Chris Hostetter [EMAIL PROTECTED] wrote:

 
 : Why IndexReader.lastModified(index) is depricated?
 
 Did you read the javadocs?
 
Synchronization of IndexReader and IndexWriter instances is no
 longer
done via time stamps of the segments file since the time
 resolution
depends on the hardware platform. Instead, a version number is
maintained within the segments file, which is incremented
 everytime
when the index is changed.
 
 : It's always a good idea to know when the index changed last time,
 for
 
 That's a good point, and you can still get that information using the
 same
 underlying method IndexReader.lastModified did/does...
 
  directory.fileModified(segments);
 
 ...it's just no longer crucial that IndexReader have that
 information.
 
 
 
 -Hoss
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
I didn't pay full attention to this thread, but it sounds like somebody
may be interested in RuntimeShutdownHook (or some similar name) as a
place to try to release the locks.

Otis

--- Joseph Ottinger [EMAIL PROTECTED] wrote:

 On Tue, 11 Jan 2005, Doug Cutting wrote:
 
  Joseph Ottinger wrote:
   As one for whom the question's come up recently, I'd say that
 locks need
   to be terminated gracefully, instead. I've noticed a number of
 cases where
   the locks get abandoned in exceptional conditions, which is
 almost exactly
   what you don't want.
 
  The problem is that this is hard to do from Java.  A typical
 approach is
  to put the process id in the lock file, then, if that process is
 dead,
  ignore the lock file.  But Java does not let one know process ids. 
 Java
  1.4 provides a LockFile mechanism which should mostly solve this,
 but
  Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use
 that
  feature.  Lucene 2.0 is likely to require Java 1.4 and should be
 able to
  do a better job of automatically unlocking indexes when processes
 die.
 
 Agreed - but while there are some situations in which releasing locks
 is
 difficult (i.e., JVM catastrophic shutdown), there are others in
 which
 attempts could be made via finally blocks, etc.
 

---
 Joseph B. Ottinger
 http://enigmastation.com
 IT Consultant   
 [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
Eh, that exactly :)  When I read my emails in reverse order

--- Chris Lamprecht [EMAIL PROTECTED] wrote:

 What about a shutdown hook?
   
 Runtime.getRuntime().addShutdownHook(new Thread() {
 public void run() { /* whatever */ }
 });
 
 see also
 http://www.onjava.com/pub/a/onjava/2003/03/26/shutdownhook.html
 
 
 On Tue, 11 Jan 2005 13:21:42 -0800, Doug Cutting [EMAIL PROTECTED]
 wrote:
  Joseph Ottinger wrote:
   As one for whom the question's come up recently, I'd say that
 locks need
   to be terminated gracefully, instead. I've noticed a number of
 cases where
   the locks get abandoned in exceptional conditions, which is
 almost exactly
   what you don't want.
  
  The problem is that this is hard to do from Java.  A typical
 approach is
  to put the process id in the lock file, then, if that process is
 dead,
  ignore the lock file.  But Java does not let one know process ids. 
 Java
  1.4 provides a LockFile mechanism which should mostly solve this,
 but
  Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use
 that
  feature.  Lucene 2.0 is likely to require Java 1.4 and should be
 able to
  do a better job of automatically unlocking indexes when processes
 die.
  
  Doug
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance question

2005-01-10 Thread Otis Gospodnetic
Use one index, working with a single index is simpler.  Also, once you
pull a Document from Hits object, all Fields are read off of the disk.

There was some discussion about selective Field reading about a week
ago, check the list archives.  Also keep in mind Field compression is
now possible (only with unreleased version in CVS).

Otis

--- Crump, Michael [EMAIL PROTECTED] wrote:

 Hello,
 
  
 
 If I have large text fields that are rarely retrieved but need to be
 searched often - Is it better to create 2 indices, one for searching
 and
 one for retrieval, or just one index and put everything in it?
 
  
 
 Or are there other recommendations?
 
  
 
 Regards,
 
  
 
 Michael
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Id

2005-01-07 Thread Otis Gospodnetic
Hello,

If you search for India OR Test, you will find both, if you use AND,
you will find none.  Lucene can search any text, not just files.  It
sounds like you are using Lucene's demo as a real application (not a
good practise).  I suggest you take a look at the Resources page on the
Lucene Wiki to get a better idea about what Lucene is and how it can be
used.

Otis


--- mahaveer jain [EMAIL PROTECTED] wrote:

 Hi,
  
 I have a application where I know I will have duplicate ID's. When I
 search these duplicate ID's will it search content in both the files
 ?
  
 For Example :
  
 Id = Mahaveer, Content = Jain India
 Id = Mahaveer, Content = Lucene Test
  
 Now when I search for India Test will it return both the columns ?
 Also can I display unique results ?
  
 Mahaveer
  
  
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: reading fields selectively

2005-01-06 Thread Otis Gospodnetic
Hi John,

There is no API for this, but I recall somebody talking about adding
support for this a few months back.  I even think that somebody might
have contributed a patch for this.  I am not certain about this, but
check the patch queue (link on Lucene site).  If there is a patch
there, even if the patch no longer applies cleanly, you'll be able to
borrow the code for your own patch.  Also note that the CVS version has
support for field compression, which should help with performance if
you are working with large fields.

Otis

--- John Wang [EMAIL PROTECTED] wrote:

 Hi:
 
Is there some way to read only 1 field value from an index given a
 docID?
 
From the current API, in order to get a field from given a docID,
 I
 would call:
  
 IndexSearcher.document(docID)
 
  which in turn reads in all fields from the disk.
 
Here is my problem:
 
After the search, I have a set of docIDs. For each
 document, I have a unique string identifier. At this point I only
 need
 these identifiers but with the above API, I am forced to read the
 entire row of fields for each document in the search result, which in
 my case can be very large.
 
Is there an alternative?
 
 I am thinking more on the lines of a call:
 
Field[] getFields(int docID,String fieldName);
 
 Thanks
 
 -John
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Book in UK

2005-01-06 Thread Otis Gospodnetic
The book is $44.95 USD - it's printed on the back cover.  Amazon had
the correct price (minus their discount) until recently.  They are just
very slow with their site/book info updates, but I'm sure they'll fix
it eventually.

Otis


--- Erik Hatcher [EMAIL PROTECTED] wrote:

 
 On Jan 6, 2005, at 3:49 PM, Chris Hostetter wrote:
  BN agrees that the list price is $60.95 ... which may be what
 Manning 
  is
  citing to resellers.
 
 This is incorrect information that has somehow gotten out.  Amazon
 and 
 BN are slow to update their information, but Manning assures me that
 
 they have provided the correct information to Amazon to update.  The 
 actual price you're paying is certainly not indicative of a $60.95
 list 
 price - Amazon doesn't discount 50%, I'm sure.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RemoteSearcher

2005-01-06 Thread Otis Gospodnetic
Nutch (nutch.org) has a pretty sophisticated infrastructure for
distributed searching, but it doesn't use RemoteSearcher.

Otis

--- Yura Smolsky [EMAIL PROTECTED] wrote:

 Hello.
 
 Does anyone know application which based on RemoteSearcher to
 distribute index on many servers?
 
 Yura Smolsky,
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing issue

2005-01-04 Thread Otis Gospodnetic
That's the correct place to look and it includes code samples.
Yes, it's a Jar file that you add to the CLASSPATH and use ... hm,
normally programmatically, yes :).

Otis

--- Hetan Shah [EMAIL PROTECTED] wrote:

 Has any one used NekoHTML ? If so how do I use it. Is it a stand
 alone 
 jar file that I include in my classpath and start using just like 
 IndexHTML ?
 Can some one share syntax and or code if it is supposed to be used 
 programetically. I am looking at 
 http://www.apache.org/~andyc/neko/doc/html/ for more information is
 that 
 the correct place to look?
 
 Thanks,
 -H
 
 
 Erik Hatcher wrote:
 
  Sure... clean up your HTML and it'll parse fine :)   Perhaps use
 JTidy 
  to clean up the HTML.  Or switch to using a more forgiving parser
 like 
  NekoHTML.
 
  Erik
 
  On Jan 4, 2005, at 3:59 PM, Hetan Shah wrote:
 
  Hello All,
 
  Does any one know how to handle the following parsing error?
 
  thanks for pointers/code snippets.
 
  -H
 
  While trying to parse a HTML file using IndexHTML I get
 
  Parse Aborted: Encountered \ at line 8, column 1162.
  Was expecting one of:
  ArgName ...
  = ...
  TagEnd ...
 
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help for sorting

2005-01-03 Thread Otis Gospodnetic
Hello,

--- mahaveer jain [EMAIL PROTECTED] wrote:

 I am looking out to implement sorting in my lucene application. This
 is what my code look like. 
 
 I am using StandardAnalyzer() analyzer. 
 
 Query query = QueryParser.parse(keyword, contents, analyzer); 
 
 Sort sortCol = new Sort(new SortField(date));
 
 // date is one of the field I have indexed.
 
 Hits hits = searcher.search(query, sortCol);
 
 for (int start = 0; start  hits.length(); start ++) { 
 Document doc = hits.doc(start); 
 
 // get all the data required.
 } 
 
 I get this error : 
 
 no terms in field sdate - cannot determine sort type 

Is it possible that your 'date' field is empty in some documents you
indexed?  If so, you should specify your sort field type explicitly. 
Look at the Javadoc for SortField class.

 Can any let me know where I am wrong ? Also what is the default
 sorting in lucene ? 

Default sorting is by rank/score.

 Also can some one explain what exactly is the score ? Is it something
 to do with ranking ? Do somebody have a link to a good lucene
 tutorial ? 

There are links to a few Lucene articles on Lucene's Wiki.  There is
also a link to the Lucene book (Lucene in Action) on the same page. 
Another good source of information about how to use the Lucene API are
Lucene's unit tests.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how often to optimize?

2004-12-28 Thread Otis Gospodnetic
Correct.
The self-maintenance you are referring to is Lucene's periodic segment
merging.  The frequency of that can be controlled through IndexWriter's
mergeFactor.

Otis

--- aurora [EMAIL PROTECTED] wrote:

  Are not optimized indices causing you any problems (e.g. slow
 searches,
  high number of open file handles)?  If no, then you don't even need
 to
  optimize until those issues become... issues.
 
 
 OK I have changed the process to not doing optimize() at all. So far
 so  
 good. The number of files hover from 10 to 40 during the indexing of 
 
 10,000 files. Seems Lucene is doing some kind of self maintenance to
 keep  
 things in order.
 
 Is it right to say optimize() is a totally optional operation? I
 probably  
 get the impression it is a natural step to end an incremental update
 from  
 the IndexHTML example. Since it replicates the whole index it might
 be an  
 overkill for many applications to do daily.
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need an analyzer that includes numbers.

2004-12-25 Thread Otis Gospodnetic
WhitespaceAnalyzer will let you have it.  It just breaks the input on
spaces.

Otis

--- Jim [EMAIL PROTECTED] wrote:

 I've seen some discussion on this and the answer seems to be write
 your 
 own.  Hasn't someone already done that by now that would share?  I 
 really have to be able to include numeric and alphanumeric strings in
 my 
 searches.   I don't understand analyzers well enough to roll my own.
 
 Thanks,
 Jim.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Otis Gospodnetic
Most definitely Jetty.  I can't believe you're using Tomcat for Rojo!
;)

Otis

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 Wrong list.
 
 Though perhaps you should be using Jetty ;)
 
   Erik
 
 
 On Dec 23, 2004, at 4:17 PM, Kevin A. Burton wrote:
 
  What in the world is up with this exception?
 
  We've migrated to using pre-compiled JSPs in Tomcat 5.5 for  
  performance reasons but if I try to start with a FRESH webapp or
 try  
  to update any of the JSPs and in-place and recompile I'll get this 
 
  error:
 
  Any idea?
 
  I thought maybe the .jar files were corrupt but if I md5sum them
 they  
  are identical to production and the Tomcat standard dist.
 
  Thoughts?
 
  org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1)  
  /init.jsp(2,0) Unable to read TLD META-INF/c.tld from JAR file  
 
 file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/ 
  standard.jar: org.apache.jasper.JasperException: Failed to load or
  
  instantiate TagLibraryValidator class:  
  org.apache.taglibs.standard.tlv.JstlCoreTLV
   
 

org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHan
 
  dler.java:39)
   
 

org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.jav
 
  a:405)
   
 

org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.jav
 
  a:86)
   
 

org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:
 
  339)
  
 org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java: 
  372)
  org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475)
  org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539)
  org.apache.jasper.compiler.Parser.parse(Parser.java:126)
   
 

org.apache.jasper.compiler.ParserController.doParse(ParserController.ja
 
  va:211)
   
 

org.apache.jasper.compiler.ParserController.parse(ParserController.java
 
  :100)
  
 org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146)
  org.apache.jasper.compiler.Compiler.compile(Compiler.java:286)
  org.apache.jasper.compiler.Compiler.compile(Compiler.java:267)
  org.apache.jasper.compiler.Compiler.compile(Compiler.java:255)
   
 

org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.j
 
  ava:556)
   
 

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j
 
  ava:296)
  
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java: 
  295)
  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 
 
  -- 
 
  Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for
 an  
  invite!  Also see irc.freenode.net #rojo if you want to chat.
 
  Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
  If you're interested in RSS, Weblogs, Social Networking, etc...
 then  
  you should work for Rojo!  If you recommend someone and we hire
 them  
  you'll get a free iPod!
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
  GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
Martijn, have you seen the Highlighter in the Lucene Sandbox?
If you've stored your text in the Lucene index, there is no need to go
back to DB to pull out the blog, parse it, and highlight it - the
Highlighter in the Sandbox will do this for you.

Otis

--- M. Smit [EMAIL PROTECTED] wrote:

 Hello list,
 
 I'm not sure if this subject will cover my question, but here goes:
 
 consider the following snippet:
 
 is = new IndexSearcher((String)
 envContext.lookup(search_index_dir));
 StopAnalyzer analyzer = new 
 StopAnalyzer(ArticleIndexer.SEARCH_STOP_WORDS_NL);
 
 parser = new 
 NewMultiFieldQueryParser(ArticleIndexer.FIELDS_SEARCH_BASIC,
 analyzer);
 parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
 query = parser.parse(searchForm.getCriteria());
 
 hits = is.search(query);
 log.info([execute] aantal Lucene hits:  + hits.length());
 
 Perfect.. And when I present the results, I retrieve the original 
 document from the database through it guid which I get from the 
 doc.get(ArticleIndexer.FIELD_GUID). And besides some businesslogic I 
 have to take care of when I retrieve the original document, I would
 also 
 like to give a context snippet.
 
 So I've written a class which takes care of this context 'snippeting
 and 
 highlighting' (perhaps somebody knows about a great project which I 
 haven't found last week while hunting for it). But I need to have the
 
 original query.. And preferable the words assiociated with the fields
 in 
 (String[]) ArticleIndexer.FIELDS_SEARCH_BASIC. Because every field 
 correspond with a different text-blob in  my DB, so I have to know
 which 
 BufferedReader I have to parse for the associated words..
 
 Thank you for your time,
 Martijn
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Otis Gospodnetic
If you are not tied to Java, see 'unac' at http://www.senga.org/.
It's old, but if nothing else you could see how it works and rewrite it
in Java.  And if you can, you can donate it to Lucene Sandbox.

Otis

--- Peter Pimley [EMAIL PROTECTED] wrote:

 
 Hi everyone,
 
 The Question:
 In Java generally, Is there an easy way to get the unicode name of a 
 character?  (e.g. LATIN SMALL LETTER A from 'a')
 
 
 The Reasoning (for those who are interested):
 The documents I'm indexing have quite a lot of characters that are 
 basically variations on the basic A-Z ones.  In my analysis step, I'd
 
 like to convert these to their closest equivalent in the basic A-Z
 set.
 
 For some letters, this is easy.  An example is the e-acute character 
 (00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
 plain 'e'.  I can do that by using the IBM ICU4J tools to decompose
 the 
 single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then
 I 
 can strip all characters that fail Character.isLetterOrDigit.  That 
 works fine.
 
 Some characters however do not decompose.  An example is the
 character 
 01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with
 
 'P', but it does not decompose into P + something.
 
 I'm considering taking the unicode name for each character I
 encounter 
 and regexping it against something like:
 ^LATIN .* LETTER (.) WITH .*$
 ... to try and extract the single A-Z|a-z character.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
I suspect Martijn really wants that snippet dynamically generated, with
KWIC, as on the lucenebook.com screen shot.  Thus, he can't generate
and store the snippet at index time, and has to construct it at search
time.

Otis

--- Mike Snare [EMAIL PROTECTED] wrote:

  But for the other issue on 'store lucene' vs 'store db'. Does
 anyone can
  provide me with some field experience on size?
  The system I'm developing will provide searching through some 2000
  pdf's, say some 200 pages each. I feed the plain text into Lucene
 on a
  Field.UnStored bases. I also store this plain text in the database
 for
  the sole purpose of presenting a context snippet.
 
 Why not store the snippet in another field that is stored, but not
 indexed?  You could then immediately retrieve the snippet from the
 doc.  This would only increase your index by num_docs * size_snippet
 and would save the db access time and complexity.
 
 -Mike
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
For simpy.com I store the full text of web pages in Lucene, in order to
provide full-text web searches.  Nutch (nutch.org) does the same.  You
can set the maximal number of tokens you want indexed via IndexWriter. 
You can also compress fields in the newest version of Lucene (or maybe
just the one in CVS), which may help you if you are considered about
disk space, although I wouldn't want to have to uncompress each hit's
200 pages worth of text in order to create a summary with KWIC. :)

Oh, and you asked about highlighter and field/query matching.  I
_think_ it won't help you with that, but I'm a bit behind on the
highlighter, so you should check the version in CVS and see if it's
capable of this.

Otis


--- M. Smit [EMAIL PROTECTED] wrote:

 Erik Hatcher wrote:
 
 
  Highlighter does not mandate you store your text in the index.  It
 is 
  just a convenient way to do it.  You're free to pull the text from 
  anywhere and highlight it based on the query.
 
  Furthermore, you are saying that the highlighter takes care of the
 
  corresponding field/words for me and pull up a context snippet?
 Ouch, 
  why haven't I stumpled upon the sandbox
 
 
  See a screenshot of it here:  http://www.lucenebook.com (going live
 
  within a week!)
 
 Oh bliss, Oh joy.. This is exactly what I'm looking for... I'll
 plunge 
 in to it and let you know!
 
 But for the other issue on 'store lucene' vs 'store db'. Does anyone
 can 
 provide me with some field experience on size?
 The system I'm developing will provide searching through some 2000 
 pdf's, say some 200 pages each. I feed the plain text into Lucene on
 a 
 Field.UnStored bases. I also store this plain text in the database
 for 
 the sole purpose of presenting a context snippet.
 
 If I were to use the Highlighter with a Field.Text, I will not use
 the 
 database plain part altogether. But still I'm a little worried about 
 speed/space issues. Or am I just seeing bears-on-the-road (Dutch
 saying, 
 in plain English: making a fuzz about nothing)..
 
 Martijn
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: addIndexes() Question

2004-12-22 Thread Otis Gospodnetic
I _think_ you'd be better off doing it all at once, but I wouldn't
trust myself on this and would instead construct a small 3-index set
and test, looking at a) maximal disk usage, b) time, and c) RAM usage.
:)

Otis

--- Ryan Aslett [EMAIL PROTECTED] wrote:

  
 Hi there, Im about to embark on a Lucene project of massive scale
 (between 500 million and 2 billion documents).  I am currently
 working
 on parallellizing the construction of the Index(es). 
 
 Rough summary of my plan:
 I have many, many physical machines, each with multiple processors
 that
 I wish to dedicate to the construction of a single index. 
 I plan on having each machine gather its documents from a central
 sychronized source (network, JMS, whatever). 
 Within each machine I will have multiple threads each responsible for
 construcing an index slice.
 
 When all machines and all threads are finished, I should have a slew
 of
 index slices that I want to combine together to create one index.
 
 My question is this:  Will it be more efficient to call
 addIndexes(Directory[] dirs) on all the slices all at once? 
 
 Or might it be better to continually merge small indexes into a
 larger
 index, i.e. once an index slice reaches a particular size, merge it
 into
 the main index and start building a new slice...
 
 Any help would be appreciated.. 
 
 Ryan Aslett
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index size doubled?

2004-12-21 Thread Otis Gospodnetic
Another possibility is that you are using an older version of Lucene,
which was known to have a bug with similar symptoms.  Get the latest
version of Lucene.

You shouldn't really have multiple .cfs files after optimizing your
index.  Also, optimize only at the end, if you care about indexing
speed.

Otis

--- Paul Elschot [EMAIL PROTECTED] wrote:

 On Tuesday 21 December 2004 05:49, aurora wrote:
  I'm testing the rebuilding of the index. I add several hundred
 documents,  
  optimize and add another few hundred and so on. Right now I have
 around  
  7000 files. I observed after the index gets to certain size.
 Everytime  
  after optimize, the are two files roughly the same size like below:
  
  12/20/2004  01:57p  13 deletable
  12/20/2004  01:57p  29 segments
  12/20/2004  01:53p  14,460,367 _5qf.cfs
  12/20/2004  01:57p  15,069,013 _5zr.cfs
  
  The index total index is double of what I expect. This is not
 always  
  reproducible. (I'm constantly tuning my program and the set of
 document).  
  Sometime I get a decent single document after optimize. What was
 happening?
 
 Lucene tried to delete the older version (_5cf.cfs above), but got an
 error
 back from the file system. After that it has put the name of that
 segment in
 the deletable file, so it can try later to delete that segment.
 
 This is known behaviour on FAT file systems. These randomly take some
 time
 for themselves to finish closing a file after it has been correctly
 closed by
 a program.
 
 Regards,
 Paul Elschot
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how often to optimize?

2004-12-21 Thread Otis Gospodnetic
Hello,

I think some of these questions my be answered in the jGuru FAQ

 So my question is would it be an overkill to optimize everyday?

Only if lots of documents are being added/deleted, and you end up with
a lot of index segments.

 Is
 there  
 any guideline on how often to optimize? Every 1000 documents or more?

Are not optimized indices causing you any problems (e.g. slow searches,
high number of open file handles)?  If no, then you don't even need to
optimize until those issues become... issues.

 Every week? Is there any concern if there are a lot of documents
 added without optimizing?

Possibly, see my answer above.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: analyzer effecting phrases?

2004-12-20 Thread Otis Gospodnetic
When searching for phrases, what's important is the position of each
token/word extracted by the Analyzer. 
WhitespaceAnalyzer/LowerCaseFilter don't do anything with the
positional information.  There is nothing else in your Analyzer?

In any case, the following should help you see what your Analyzer is
doing:
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can
augment the code there to provide positional information, too.

Otis

--- Peter Posselt Vestergaard [EMAIL PROTECTED] wrote:

 Hi
 I am building an index of texts, each related to a unique id. The
 unique ids
 might contain a number of underscores which will make the
 standardanalyzer
 shorten them after it sees the second underscore in a row.
 Furthermore many
 of the texts I am indexing is in Italian so the removal of 'trivial'
 words
 done by the standard analyzer is not necessarily meaningful for these
 texts.
 Therefore I am instead using an analyzer made from the
 WhitespaceTokenizer
 and the LowerCaseFilter.
 This works fine for me until I try searching for a phrase. I am
 searching
 for a simple phrase containing two words and with double-quotes
 around it. I
 have found the phrase in one of the texts so I know it should return
 at
 least one result, but none is found. If I remove the double-quotes
 and
 searches for the 2 words with AND between them I do find the story.
 Can anyone tell me if this is an obvious (side-)effect of not using
 the
 standard analyzer? And is there a better solution to my problem than
 using
 the very simple analyzer?
 Best regards
 Peter Vestergaard
 PS: I use the same analyzer for both searching and indexing (of
 course).
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Queries difference

2004-12-20 Thread Otis Gospodnetic
Alex, I think you want this:

+city:London +city:Amsterdam +address:1_street +address:2_street

Otis


--- Alex Kiselevski [EMAIL PROTECTED] wrote:

 
 Thanks Morus
 So if I understand right
 If the seqond query is :
 +city(London) +city(Amsterdam) +address(1_street)  +address(2_street)
 
 Both queries have the same value ?
 -Original Message-
 From: Morus Walter [mailto:[EMAIL PROTECTED]
 Sent: Monday, December 20, 2004 6:11 PM
 To: Lucene Users List
 Subject: Re: Queries difference
 
 
 Alex Kiselevski writes:
 
  Hello, I want to know is there a difference between queries:
 
  +city(+London Amsterdam) +address(1_street 2_street)
 
  And
 
  +city(+London) +city(Amsterdam) +address(1_street) 
 +address(2_street)
 
 I guess you mean city:(... and so on.
 
 The first query searches documents containing 'London' in city,
 scoring
 results also containing Amsterdam higher, and containing 1_street or
 2_street in address. The second query searches for documents
 containing
 both London and Amsterdam in city and 1_street and 2_street in
 address.
 Note the the + before London in the second query doesn't mean
 anything.
 
 HTH
   Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged.
 The information is intended to be conveyed only to the designated
 recipient(s)
 of the message. If the reader of this message is not the intended
 recipient,
 you are hereby notified that any dissemination, use, distribution or
 copying of
 this communication is strictly prohibited and may be unlawful.
 If you have received this communication in error, please notify us
 immediately
 by replying to the message and deleting it from your computer.
 Thank you.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing with Lucene 1.4.3

2004-12-17 Thread Otis Gospodnetic
The only place where you have to specify that you are using the
compound index format is on IndexWriter instance.  Nothing needs to be
done at search time on IndexSearcher.

Otis

--- Hetan Shah [EMAIL PROTECTED] wrote:

 Thanks Chuck,
 
 I now understand why I see only one file. Another question is do I
 have 
 to specify somewhere in my code or some configuration setting that I 
 would now be using a compound file format (.cfs file) for index. I
 have 
 an application that was working in version 1.3-final till I moved to 
 1.4.3 now I do not get any results back from my searches.
 
 I tried using Luke and it shows me the content of the index. I can 
 search using Luke but no success so far with my own application.
 
 Any pointers?
 
 Thanks.
 -H
 
 Chuck Williams wrote:
 
 That looks right to me, assuming you have done an optimize.  All of
 your
 index segments are merged into the one .cfs file (which is large,
 right?).  Try searching -- it should work.
 
 Chuck
 
-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 16, 2004 11:00 AM
To: Lucene Users List
Subject: Indexing with Lucene 1.4.3

Hello,

I have been trying to index around 6000 documents using
 IndexHTML
 from
1.4.3 and at the end of indexing in my index directory I only
 have 3
files.
segments
deletable and
_5en.cfs

Can someone tell me what is going on and where are the actual
 index
files? How can I resolve this issue?
Thanks.
-H


   

-
To unsubscribe, e-mail:
 [EMAIL PROTECTED]
For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Disk space needed for indexing???

2004-12-16 Thread Otis Gospodnetic
The exact disk space usage depends on the number of fields in the index
and on how many of them store the original text.  You should also keep
in mind that the call to IndexWriter's optimize() will result in your
index directory size doubling while the optimization is in progress, so
if you want to optimize you will need extra free disk space.

Otis


--- [EMAIL PROTECTED] wrote:

 
 
 Hi, everyone,
 
 Does anyone have any idea how much disk space will be needed for
 generating the final index with ~1.5G size, for example?
 
 I have ~3.5G disk space and is able to generate index with 1G size.
 However, after I add more records, it will run out of disk space.
 Does
 Lucene suppose to take so much disk space for indexing? Is there any
 way
 that I can improve the code to let it take less space?
 
 
 Thanks,
 Ying
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why does the StandardTokenizer split hyphenated words?

2004-12-16 Thread Otis Gospodnetic
Hello,

As Erik already said - that Analyzer is really there to get people
going quickly and as a 'does pretty good' Analyzer.  There is no
Analyzer that will work for everyone, and Analyzers are meant to be
custom-made.  It looks like you already got that figured out and have
your own Analyzer.

Otis

--- Mike Snare [EMAIL PROTECTED] wrote:

 Absolutely, but -- correct me if I'm wrong -- it would give no higher
 ranking to half-baked and would take a good deal longer on large
 indices.
 
 
 On Thu, 16 Dec 2004 20:03:27 +0100, Daniel Naber
 [EMAIL PROTECTED] wrote:
  On Thursday 16 December 2004 13:46, Mike Snare wrote:
  
Maybe for a-b, but what about English words like
 half-baked?
  
   Perhaps that's the difference in thinking, then.  I would imagine
 that
   you would want to search on half-baked and not half AND
 baked.
  
  A search for half-baked will find both half-baked and half baked
 (the
  phrase). The only thing you'll not find if halfbaked.
  
  Regards
   Daniel
  
  --
  http://www.danielnaber.de
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Hello Homam,

The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.

Don't close IndexWriter - use the single instance.

There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory Documents, the frequence of segment merges, and
maximal size of an index segments with 3 IndexWriter parameters,
described fairly verbosely in the javadocs.

Since you are using the .Net version, you should really consult
dotLucene guy(s).  Running under the profiler should also tell you
where the time and memory go.

Otis

--- Homam S.A. [EMAIL PROTECTED] wrote:

 Thanks Otis!
 
 What do you mean by building it in batches? Does it
 mean I should close the IndexWriter every 1000 rows
 and reopen it? Does that releases references to the
 document objects so that they can be
 garbage-collected?
 
 I'm calling optimize() only at the end.
 
 I agree that 1500 documents is very small. I'm
 building the index on a PC with 512 megs, and the
 indexing process is quickly gobbling up around 400
 megs when I index around 1800 documents and the whole
 machine is grinding to a virtual halt. I'm using the
 latest DotLucene .NET port, so may be there's a memory
 leak in it.
 
 I have experience with AltaVista search (acquired by
 FastSearch), and I used to call MakeStable() every
 20,000 documents to flush memory structures to disk.
 There doesn't seem to be an equivalent in Lucene.
 
 -- Homam
 
 
 
 
 
 
 --- Otis Gospodnetic [EMAIL PROTECTED]
 wrote:
 
  Hello,
  
  There are a few things you can do:
  
  1) Don't just pull all rows from the DB at once.  Do
  that in batches.
  
  2) If you can get a Reader from your SqlDataReader,
  consider this:
 

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
  
  3) Give the JVM more memory to play with by using
  -Xms and -Xmx JVM
  parameters
  
  4) See IndexWriter's minMergeDocs parameter.
  
  5) Are you calling optimize() at some point by any
  chance?  Leave that
  call for the end.
  
  1500 documents with 30 columns of short
  String/number values is not a
  lot.  You may be doing something else not Lucene
  related that's slowing
  things down.
  
  Otis
  
  
  --- Homam S.A. [EMAIL PROTECTED] wrote:
  
   I'm trying to index a large number of records from
  the
   DB (a few millions). Each record will be stored as
  a
   document with about 30 fields, most of them are
   UnStored and represent small strings or numbers.
  No
   huge DB Text fields.
   
   But I'm running out of memory very fast, and the
   indexing is slowing down to a crawl once I hit
  around
   1500 records. The problem is each document is
  holding
   references to the string objects returned from
   ToString() on the DB field, and the IndexWriter is
   holding references to all these document objects
  in
   memory, so the garbage collector is getting a
  chance
   to clean these up.
   
   How do you guys go about indexing a large DB
  table?
   Here's a snippet of my code (this method is called
  for
   each record in the DB):
   
   private void IndexRow(SqlDataReader rdr,
  IndexWriter
   iw) {
 Document doc = new Document();
 for (int i = 0; i  BrowseFieldNames.Length; i++)
  {
 doc.Add(Field.UnStored(BrowseFieldNames[i],
   rdr.GetValue(i).ToString()));
 }
 iw.AddDocument(doc);
   }
   
   
   
   
 
   __ 
   Do you Yahoo!? 
   Yahoo! Mail - Find what you need with new enhanced
  search.
   http://info.mail.yahoo.com/mail_250
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
   
 __ 
 Do you Yahoo!? 
 Take Yahoo! Mail with you! Get it on your mobile phone. 
 http://mobile.yahoo.com/maildemo 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi Homan
 
 I had a similar problem as you in that I was indexing A LOT of data
 
 Essentially how I got round it was to batch the index.
 
 What I was doing was to add 10,000 documents to a temporary index,
 use
 addIndexes() to merge to temporary index into the live index (which
 also
 optimizes the live index) then delete the temporary index. On the
 next loop
 I'd only query rows from the db above the id in the maxdoc of the
 live index
 and set the max rows of the query to to 10,000
 i.e
 
 SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
 Index.MaxDoc()} ORDER BY [id_field] ASC
 
 Ensuring that the documents go into the index sequentially your
 problem is
 solved and memory usage on mine (dotlucene 1.3) is low
 
 Regards
 Garrett
 
 -Original Message-
 From: Homam S.A. [mailto:[EMAIL PROTECTED] 
 Sent: 15 December 2004 02:43
 To: Lucene Users List
 Subject: Indexing a large number of DB records
 
 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A question about scoring function in Lucene

2004-12-15 Thread Otis Gospodnetic
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is  X.  So that is where the absolue value of the
score would be useful.

I believe Chuck submitted some code that fixes this, which also helps
with MultiSearcher, where you have to have this contant score in order
to properly order hits from different Searchers, but I didn't dare to
touch that code without further studying, for which I didn't have time.

Otis


--- Doug Cutting [EMAIL PROTECTED] wrote:

 Chuck Williams wrote:
  I believe the biggest problem with Lucene's approach relative to
 the pure vector space model is that Lucene does not properly
 normalize.  The pure vector space model implements a cosine in the
 strictly positive sector of the coordinate space.  This is guaranteed
 intrinsically to be between 0 and 1, and produces scores that can be
 compared across distinct queries (i.e., 0.8 means something about
 the result quality independent of the query).
 
 I question whether such scores are more meaningful.  Yes, such scores
 
 would be guaranteed to be between zero and one, but would 0.8 really
 be 
 meaningful?  I don't think so.  Do you have pointers to research
 which 
 demonstrates this?  E.g., when such a scoring method is used, that 
 thresholding by score is useful across queries?
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John,

Once you make your change locally, use 'cvs diff -u IndexWriter.java 
indexwriter.patch' to make a patch.
Then open a new Bugzilla entry.
Finally, attach your patch to that entry.

Note that Document deletion is actually done from IndexReader, so your
patch may have to be on IndexReader, not IndexWriter.

Thanks,
Otis


--- John Wang [EMAIL PROTECTED] wrote:

 Hi Otis:
 
  Thanks for you reply.
 
  I am looking for more of an API call than a tool. e.g.
 IndexWriter.finalizeDelete()
 
  If I implement this, how would I go about submitting a patch?
 
 thanks
 
 -John
 
 
 On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic
 [EMAIL PROTECTED] wrote:
  Hello John,
  
  I believe you didn't get any replies to this.  What you are
 describing
  cannot be done using the public, but maaay (no source code on this
  machine, so I can't double-check that) be doable if you use some of
 the
  'internal' methods.
  
  I don't have the need for this, but others might, so it may be
 worth
  developing a tool that purges Documents marked as deleted without
 the
  expensive segment merging, iff that is possible.  If you put this
 tool
  under the approprite org.apache.lucene... package, you'll get
 access to
  'internal' methods, of course.  If you end up creating this, we
 could
  stick it in the Sandbox, where we should really create a new
 section
  for handy command-line tools that manipulate the index.
  
  Otis
  
  
  
  
  --- John Wang [EMAIL PROTECTED] wrote:
  
   Hi:
  
  Is there a way to finalize delete, e.g. actually remove them
 from
   the segments and make sure the docIDs are contiguous again.
  
  The only explicit way to do this is by calling
   IndexWriter.optmize(). But this call does a lot more (also merges
 all
   the segments), hence is very expensive. Is there a way to simply
 just
   finalize the deletes without having to merge all the segments?
  
   If not, I'd be glad to submit an implementation of this
 feature
   if
   the Lucene devs agree this is useful.
  
   Thanks
  
   -John
  
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Otis Gospodnetic
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2item=source

Otis

--- Bruce Ritchie [EMAIL PROTECTED] wrote:

 Christoph,
 
 I'm not entirely certain if this is what you want, but a while back
 David Spencer did code up a 'More Like This' class which can be used
 for generating similarities between documents. I can't seem to find
 this class in the sandbox so I've attached it here. Just repackage
 and test.
 
 
 Regards,
 
 Bruce Ritchie
 http://www.jivesoftware.com/   
 
  -Original Message-
  From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
  Sent: December 14, 2004 11:45 AM
  To: Lucene Users List
  Subject: TFIDF Implementation
  
  Hi,
  My current task/problem is the following: I need to implement 
  TFIDF document term ranking using Jakarta Lucene to compute a 
  similarity rank between arbitrary documents in the constructed
 index.
  I saw from the API that there are similar functions already 
  implemented in the class Similarity and DefaultSimilarity but 
  I don't know exactly how to use them. At the time my index 
  has about 25000 (small) documents and there are about 75000 
  terms stored in total.
  Now, my question is simple. Does anybody has done this before 
  or could point me to another location for help?
  
  Thanks for any help in advance.
  Christoph 
  
  --
  Christoph Kiefer
  
  Department of Informatics, University of Zurich
  
  Office: Uni Irchel 27-K-32
  Phone:  +41 (0) 44 / 635 67 26
  Email:  [EMAIL PROTECTED]
  Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 
-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
Well, one could always partition an index, distribute pieces of it
horizontally across multiple 'search servers' and use the built-in
RMI-based and Parallel search feature.  Nutch uses something similar
for search scaling.

Otis


--- Monsur Hossain [EMAIL PROTECTED] wrote:

  My concern is that this just shifts the scaling issue to 
  Lucene, and I haven't found much info on how to scale Lucene 
  vertically.  
 
 By vertically, of course, I meant horizontally.  Basically
 scaling
 it across servers as one might do with a relational database.
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John,

I believe you didn't get any replies to this.  What you are describing
cannot be done using the public, but maaay (no source code on this
machine, so I can't double-check that) be doable if you use some of the
'internal' methods.  

I don't have the need for this, but others might, so it may be worth
developing a tool that purges Documents marked as deleted without the
expensive segment merging, iff that is possible.  If you put this tool
under the approprite org.apache.lucene... package, you'll get access to
'internal' methods, of course.  If you end up creating this, we could
stick it in the Sandbox, where we should really create a new section
for handy command-line tools that manipulate the index.

Otis


--- John Wang [EMAIL PROTECTED] wrote:

 Hi:
 
Is there a way to finalize delete, e.g. actually remove them from
 the segments and make sure the docIDs are contiguous again.
 
The only explicit way to do this is by calling
 IndexWriter.optmize(). But this call does a lot more (also merges all
 the segments), hence is very expensive. Is there a way to simply just
 finalize the deletes without having to merge all the segments?
 
 If not, I'd be glad to submit an implementation of this feature
 if
 the Lucene devs agree this is useful.
 
 Thanks
 
 -John
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
You can see Flickr-like tag (lookup) system at my Simpy site (
http://www.simpy.com ).  It uses Lucene as the backend for lookups, but
still uses a RDBMS as the primary storage.

I find it that keeping the RDBMS and Lucene indices is a bit of a pain
and error prone, so _thin_ storage layer with simple requirements will
be okay with just using Lucene, while applications with more complex
domain models will quickly run into limitation (using the wrong tool
for the job type of problem).

Otis

--- Monsur Hossain [EMAIL PROTECTED] wrote:

 I think this is a great idea, and one that I've been mulling over to
 implement keyword lookups (similar to Flickr.com's tag system).  I
 believe the advantage over a relational database comes from Lucene's
 inverted index, which is highly optimized for this kind of lookup.  
 
 My concern is that this just shifts the scaling issue to Lucene, and
 I
 haven't found much info on how to scale Lucene vertically.  
 
 
 
 
  -Original Message-
  From: Kevin L. Cobb [mailto:[EMAIL PROTECTED] 
  Sent: Tuesday, December 14, 2004 9:40 AM
  To: [EMAIL PROTECTED]
  Subject: Opinions: Using Lucene as a thin database
  
  
  I use Lucene as a legitimate search engine which is cool. 
  But, I am also using it as a simple database too. I build an 
  index with a couple of keyword fields that allows me to 
  retrieve values based on exact matches in those fields. This 
  is all I need to do so it works just fine for my needs. I 
  also love the speed. The index is small enough that it is 
  wicked fast. Was wondering if anyone out there was doing the 
  same of it there are any dissenting opinions on using Lucene 
  for this purpose. 
  
   
  
   
  
   
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing a large number of DB records

2004-12-14 Thread Otis Gospodnetic
Hello,

There are a few things you can do:

1) Don't just pull all rows from the DB at once.  Do that in batches.

2) If you can get a Reader from your SqlDataReader, consider this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)

3) Give the JVM more memory to play with by using -Xms and -Xmx JVM
parameters

4) See IndexWriter's minMergeDocs parameter.

5) Are you calling optimize() at some point by any chance?  Leave that
call for the end.

1500 documents with 30 columns of short String/number values is not a
lot.  You may be doing something else not Lucene related that's slowing
things down.

Otis


--- Homam S.A. [EMAIL PROTECTED] wrote:

 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing HTML files give following message

2004-12-12 Thread Otis Gospodnetic
Hello,

This is probably due to some bad HTML.  The application you are using
is just a demo, and uses a JavaCC-based HTML parser, which may not be
resilient to invalid HTML.  For Lucene in Action we developed a little
extensible indexing framework, and for HTML indexing we used 2 tools to
handle HTML parsing: JTidy and NekoHTML.  Since the code for the book
is freely available... http://www.manning.com.  NekoHTML knows how
to deal with some bad HTML, that's why I'm suggesting this.
The indexing framework could come handy for those working on various
'desktop search' applications (Roosster, LDesktop (if that's really
happening), Lucidity, etc.)

Otis


--- Hetan Shah [EMAIL PROTECTED] wrote:

 java org.apache.lucene.demo.IndexHTML -create -index
 /source/workarea/hs152827/newIndex ..
 adding ../0/10037.html
 adding ../0/10050.html
 adding ../0/1006132.html
 adding ../0/1013223.html
 Parse Aborted: Encountered \ at line 5, column 1.
 Was expecting one of:
 ArgName ...
 = ...
 TagEnd ...
 
 And then the indexing hangs on this line. Earlier it used to go on
 and
 index remaining pages in the directory. Any idea why would the
 indexer
 stop at this error.
 
 Pointers are much needed and appreciated.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding unused segment files?

2004-12-12 Thread Otis Gospodnetic
Hello George,

Here is a quick hack (with a few TODOs).  I only tested it a bit, so
the actual delete calls are still commented out.  If this works for
you, and especially if you take care of TODOs, I may put this in the
Lucene Sandbox.

Otis
P.S.
Usage example showing how the fool found some unused segments (this was
caused by a bug in one of the earlier 1.4 versions of Lucene).

[EMAIL PROTECTED] java]$ java org.apache.lucene.index.SegmentPurger
/simpy/users/1/index
Candidate non-Lucene file found: _1b2.del
Candidate unused Lucene file found: _1b2.cfs
Candidate unused Lucene file found: _1bm.cfs
Candidate unused Lucene file found: _1c6.cfs
Candidate unused Lucene file found: _1cq.cfs
Candidate unused Lucene file found: _1da.cfs
Candidate unused Lucene file found: _1du.cfs
Candidate unused Lucene file found: _1ee.cfs
Candidate unused Lucene file found: _1ey.cfs
[EMAIL PROTECTED] java]$
[EMAIL PROTECTED] java]$ strings /simpy/users/1/index/segments
_3o0
[EMAIL PROTECTED] java]$ ls -al /simpy/users/1/index/
total 647
drwxrwsr-x2 otis simpy1024 Dec  7 14:39 .
drwxrwsr-x3 otis simpy1024 Sep 16 20:39 ..
-rw-rw-r--1 otis simpy  212815 Nov 17 18:36 _1b2.cfs
-rw-rw-r--1 otis simpy 104 Nov 17 18:40 _1b2.del
-rw-rw-r--1 otis simpy3380 Nov 17 18:40 _1bm.cfs
-rw-rw-r--1 otis simpy3533 Nov 17 18:40 _1c6.cfs
-rw-rw-r--1 otis simpy4774 Nov 17 18:40 _1cq.cfs
-rw-rw-r--1 otis simpy3389 Nov 17 18:40 _1da.cfs
-rw-rw-r--1 otis simpy3809 Nov 17 18:40 _1du.cfs
-rw-rw-r--1 otis simpy3423 Nov 17 18:40 _1ee.cfs
-rw-rw-r--1 otis simpy4016 Nov 17 18:40 _1ey.cfs
-rw-rw-r--1 otis simpy  410299 Dec  7 14:39 _3o0.cfs
-rw-rw-r--1 otis simpy   4 Dec  7 14:39 deletable
-rw-rw-r--1 otis simpy  29 Dec  7 14:39 segments


--- [EMAIL PROTECTED] wrote:

 Hello all.
 
  
 
 I recently ran into a problem where errors during indexing or
 optimization
 (perhaps related to running out of disk space) left me with a working
 index
 in a directory but with additional segment files (partial) that were
 unneeded.  The solution for finding the ~40 files to keep out of the
 ~900
 files in the directory amounted to dumping the segments file and
 noting that
 only 5 segments were in fact live.  The index is a non-compound
 index
 using FSDirectory.
 
  
 
 Is there (or would it be possible to add (and I'd be willing to
 submit code
 if it made sense to people)) some sort of interrogation on the index
 of what
 files belonged to it?  I looked first as FSDirectory itself thinking
 that
 it's list() method should return the subset of index-related files
 but
 looking deeper it looks like Directory is at a lower level
 abstracting
 simple I/O and thus wouldn't know.
 
  
 
 So any thoughts?  Would it make sense to have a form of clean on
 IndexWriter()?  I hesitate since it seems there isn't a charter that
 only
 Lucene files could exist in the directory thus what is ideal for my
 application (since I know I won't mingle other files) might not be
 ideal for
 all.  Would it be fair to look for Lucene known extensions and file
 naming
 signatures to identify unused files that might be failed or dead
 segments?
 
  
 
 Thanks,
 
 -George
 
 package org.apache.lucene.index;

import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.FSDirectory;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Iterator;
import java.io.File;


/**
 * A tool that peeks into Lucene index directories and removes
 * unwanted files.  In its more radical mode, this tool can be used to
 * remove all non-Lucene index files from a directory.  The other
 * option is to remove unused Lucene segment files, should the index
 * directory get polluted.
 *
 * TODO: this tool should really lock the directory for writing before
 * removing any Lucene segment files, otherwise this tool itself may
 * corrupt the index.
 *
 * @author Otis Gospodnetic
 * @version $Id$
 */
public class SegmentPurger
{
// TODO: copied from SegmentMerger - should probably made public
// static final, to make it reusable
// TODO: add .del extension

// File extensions of old-style index files
public static final String MULTIFILE_EXTENSIONS[] = new String[] {
fnm, frq, prx, fdx, fdt, tii, tis
};
public static final String VECTOR_EXTENSIONS[] = new String[] {
tvx, tvd, tvf
};
public static final String COMPOUNDFILE_EXTENSIONS[] = new String[] {
cfs
};
public static final String INDEX_FILES[] = new String[] {
segments, deletable
};

public static final String[][] SEGMENT_EXTENSIONS = new String[][] {
MULTIFILE_EXTENSIONS, COMPOUNDFILE_EXTENSIONS, VECTOR_EXTENSIONS
};

/** The file format version, a negative number. */
/* Works since counter, the old

RE: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Otis Gospodnetic
Ying,

You should follow this finally block advice below.  In addition, I
think you can just close the reader, and it will close the underlying
stream (I'm not sure about that, double-check it).

You are not running out of file handles, though.  Your JVM is running
out of memory.  You can play with:

1) -Xms and -Xmx JVM command-line parameters
2) IndexWriter's parameters: mergeFactor and minMergeDocs - check the
Javadocs for more info.  They will let you control how much memory your
indexing process uses.

Otis


--- Sildy Augustine [EMAIL PROTECTED] wrote:

 I think you should close your files in a finally clause in case of
 exceptions with file system and also print out the exception. 
 
 You could be running out of file handles.
 
 -Original Message-
 From: Jin, Ying [mailto:[EMAIL PROTECTED] 
 Sent: Friday, December 10, 2004 11:15 AM
 To: [EMAIL PROTECTED]
 Subject: OutOfMemoryError with Lucene 1.4 final
 
 Hi, Everyone,
 
  
 
 We're trying to index ~1500 archives but get OutOfMemoryError about
 halfway through the index process. I've tried to run program under
 two
 different Redhat Linux servers: One with 256M memory and 365M swap
 space. The other one with 512M memory and 1G swap space. However,
 both
 got OutOfMemoryError at the same place (at record 898). 
 
  
 
 Here is my code for indexing:
 
 ===
 
 Document doc = new Document();
 
 doc.add(Field.UnIndexed(path, f.getPath()));
 
 doc.add(Field.Keyword(modified,
 
  
 DateField.timeToString(f.lastModified(;
 
 doc.add(Field.UnIndexed(eprintid, id));
 
 doc.add(Field.Text(metadata, metadata));
 
  
 
 FileInputStream is = new FileInputStream(f);  // the text file
 
 BufferedReader reader = new BufferedReader(new
 InputStreamReader(is));
 
  
 
 StringBuffer stringBuffer = new StringBuffer();
 
 String line = ;
 
 try{
 
   while((line = reader.readLine()) != null){
 
 stringBuffer.append(line);
 
   }
 
   doc.add(Field.Text(contents, stringBuffer.toString()));
 
   // release the resources
 
   is.close();
 
   reader.close();
 
 }catch(java.io.IOException e){}
 
 =
 
 Is there anything wrong with my code or I need more memory?
 
  
 
 Thanks for any help!
 
 Ying
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: maxDoc()

2004-12-09 Thread Otis Gospodnetic
Hello Garrett,

Share some code, it will be easier for others to help you that way. 
Obviously, this would be a huge bug if the problem were within Lucene.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Can anyone please explain to my why maxDoc returns 0 when Luke shows
 239,473
 documents?
 
  
 
 maxDoc returns the correct number until I delete a document. And I
 have
 called optimize after the delete but still the problem remains
 
  
 
 Strange.
 
  
 
 Any ideas greatly appreciated
 
 Garrett
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problem restoring index

2004-12-08 Thread Otis Gospodnetic
There is no need to reindex.  However, I also don't quite get what the
problem is :)

Otis

--- Santosh [EMAIL PROTECTED] wrote:

 hi,
 
 when I restart the tomcat . the Index is getting corrupted. If I take
 the backup of Index and then restarting tomcat. the Index is not
 working properly. 
 
 Do I have to Index again all the documents whenever I restart the
 Tomcat?
 
 
 
 
 ---SOFTPRO
 DISCLAIMER--
 
 Information contained in this E-MAIL and any attachments are
 confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
 and 'confidential'.
 
 If you are not an intended or authorised recipient of this E-MAIL or
 have received it in error, You are notified that any use, copying or
 dissemination  of the information contained in this E-MAIL in any
 manner whatsoever is strictly prohibited. Please delete it
 immediately
 and notify the sender by E-MAIL.
 
 In such a case reading, reproducing, printing or further
 dissemination
 of this E-MAIL is strictly prohibited and may be unlawful.
 
 SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
 hereto is free from computer viruses or other defects.
 
 The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
 those of the author and are not necessarily those of SOFTPRO SYSTEMS.


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searchig with special characters

2004-12-08 Thread Otis Gospodnetic
Leading wildcard character (*) is not allowed if you use QueryParser
that comes with Lucene.  Reason: performance.  See many discussions
about this on lucene-user mailing list.  Also see the search sytax
document on the Lucene site.  What other characters are you having
trouble with?

Otis


--- Santosh [EMAIL PROTECTED] wrote:

 whenever I search with some special chracters like  *world   I am
 getting exception . how can I avoid this? and for what other
 characters lucene give this type of exceptions?
 
 
 ---SOFTPRO
 DISCLAIMER--
 
 Information contained in this E-MAIL and any attachments are
 confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
 and 'confidential'.
 
 If you are not an intended or authorised recipient of this E-MAIL or
 have received it in error, You are notified that any use, copying or
 dissemination  of the information contained in this E-MAIL in any
 manner whatsoever is strictly prohibited. Please delete it
 immediately
 and notify the sender by E-MAIL.
 
 In such a case reading, reproducing, printing or further
 dissemination
 of this E-MAIL is strictly prohibited and may be unlawful.
 
 SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
 hereto is free from computer viruses or other defects.
 
 The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
 those of the author and are not necessarily those of SOFTPRO SYSTEMS.


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Empty/non-empty field indexing question

2004-12-08 Thread Otis Gospodnetic
Correct.
No, there is no point in putting an empty field there.

Otis

--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hi Otis
 
 What kind of implications does that produce on the search?
 
 If I understand correctly that record would not be searched for if
 the 
 field is not there, correct?
 But then is there a point putting an empty value in it, if an 
 application will never search for empty values?
 
 
 thanks
 
 -pedja
 
 
 Otis Gospodnetic said the following on 12/8/2004 1:31 AM:
 
 Empty fields won't add any value, you can skip them.  Documents in
 an
 index don't have to be uniform.  Each Document could have a
 different
 set of fields.  Of course, that has some obvious implications for
 search, but is perfectly fine technically.
 
 Otis
 
 --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
   
 
 Here's probably a silly question, very newbish, but I had to ask.
 Since I have mysql documents that contain over 30 fields each and
 most of them
 are added to the index, is it a common practice to add fields to
 the
 index with 
 empty values, for that perticular record, or should the field be
 totally omitted.
 
 What I mean is if let's say a Title field is empty on a specific
 record (in mysql)
 should I still add that field into Lucene index with an empty value
 or just
 skip it and only add the fields that contain non-empty values?
 
 thanks
 
 -pedja
 
 
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
   
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'IN' type search

2004-12-08 Thread Otis Gospodnetic
Hello,

You can use BooleanQuery for that.

Otis

--- Ravi [EMAIL PROTECTED] wrote:

  
 Hi
  How do you get all documents in lucene where a particular field
 value
 is in a given list of values (like SQL IN). What kind of Query class
 should I use?
 
 Thanks in advance.
 Ravi.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: When is the book released?

2004-12-07 Thread Otis Gospodnetic
Hello,

Yes, Lucene in Action has been listed on Amazon for a while now (I
think I recorded this in my blog some time back).  The publish date is,
I believe, the date provided by publishers, but things almost always
take longer than predicted, so 31.12.2004 may be a bit off. :( 
However, the ebook should be out any time now, as Erik already
mentioned.  It's cheaper, saves trees, and doesn't consume precious
horizontal surfaces in your home (I live in New York City, where large
living spaces are hard to find unless you live in a former warehouse or
pay big money).

Otis


just like lots of software 

--- Palmer, Andrew MMI Woking [EMAIL PROTECTED] wrote:

 
 
 I have just had a quick look at both the US and UK version of Amazon
 and
 they both list the book as Lucene In Action.
 
 
 I was curious as I work for the UK Bibliographic agency and it was on
 our database and should have been on Amazon for a least a couple of
 weeks.  The agency has known about the book since start of September.
 
 
 It has a publication date of 31/12/2004.
 
 Andrew
 
 
 
 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
 Sent: 07 December 2004 12:45
 To: Lucene Users List
 Subject: Re: When is the book released?
 
 
 On Dec 7, 2004, at 5:27 AM, Aad Nales wrote:
  Sorry if this is a misspost but I have been visiting Amazon daily
 the
  last few weeks and I still can't get the Lucene book there. How
 will I
  survive the holidays? :-)
 
  But seriously when can we expect the release?
 
 Manning will have the electronic book version available *TODAY* 
 (hopefully).  It has been sent to the printers and this process takes
 a 
 few weeks.  I don't expect Amazon.com to be shipping them until
 January 
 though - the book industry really is slow moving.
 
 Otis and I thank everyone for their patience and you can be sure that
 
 no one wants the book in their hands more then he and I :)
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Otis Gospodnetic
If you run the same query again, the IndexSearcher will go all the way
to the index again - no caching.  Some caching will be done by your
file system, possibly, but that's it.  Lucene is fast, so don't
optimize early.

Otis


--- Ben Rooney [EMAIL PROTECTED] wrote:

 thanks chris,
 
 you are correct that i'm not sure if i need the caching ability.  it
 is
 more to understand right now so that if we do need to implement it, i
 am
 able to.
 
 the reason for the caching is that we will have listing pages for
 certain content types.  for example a listing page of articles.  this
 listing will be generated against lucene engine using a basic query.
 the page will also have the ability to filter the articles based on
 date
 range as one example.  so caching those results could be beneficial.
 
 however, we will also potentially want to cache the basic query so
 that
 subsequent queries will hit a cache.  when new content is published
 or
 content is removed from the site, the caches will need to be
 invalidated
 so new results are created.
 
 for the basic query, is there any caching mechanism built into the
 SearchIndexer or do we need to build our own caching mechanism?
 
 thanks
 ben
 
 On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote:
 
  :  executes the search, i would keep a static reference to
 SearchIndexer
  :  and then when i want to invalidate the cache, set it to null or
 create
  
  : design of your system.  But, yes, you do need to keep a reference
 to it
  : for the cache to work properly.  If you use a new IndexSearcher
  : instance (I'm simplifying here, you could have an IndexReader
 instance
  : yourself too, but I'm ignoring that possibility) then the
 filtering
  : process occurs for each search rather than using the cache.
  
  Assuming you have a finite number of Filters, and assuming those
 Filters
  are expensive enough to be worth it...
  
  Another approach you can take to share the cache among multiple
  IndexReaders is to explicitly call the bits method on your
 filter(s) once,
  and then cache the resulting BitSet anywhere you want (ie:
 serialize it to
  disk if you so choose).  and then impliment a BitsFilter class
 that you
  can construct directly from a BitSet regardless of the IndexReader.
  The
  down side of this approach is that it will *ONLY* work if you
 arecertain
  that the index is never being modified.  If any documents get
 added, or
  the index gets re-optimized you must regenerate all of the BitSets.
  
  (That's why the CachingWrapperFilter's cache is keyed off of hte
  IndexReader ... as long as you're re-using the same IndexReader, it
 know's
  that the cached BitSet must still be valid, because an IndexReader
  allways sees the same index as when it was opened, even if another
  thread/process modifies it.)
  
  
  class BitsFilter {
 BitSet bits;
 public BitsFilter(BitSet bits) {
   this.bits=bits;
 }
 public BitSet bigs(IndexReader r) {
   return bits.clone();
 }
  }
  
  
  
  
  -Hoss
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Empty/non-empty field indexing question

2004-12-07 Thread Otis Gospodnetic
Empty fields won't add any value, you can skip them.  Documents in an
index don't have to be uniform.  Each Document could have a different
set of fields.  Of course, that has some obvious implications for
search, but is perfectly fine technically.

Otis

--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Here's probably a silly question, very newbish, but I had to ask.
 Since I have mysql documents that contain over 30 fields each and
 most of them
 are added to the index, is it a common practice to add fields to the
 index with 
 empty values, for that perticular record, or should the field be
 totally omitted.
 
 What I mean is if let's say a Title field is empty on a specific
 record (in mysql)
 should I still add that field into Lucene index with an empty value
 or just
 skip it and only add the fields that contain non-empty values?
 
 thanks
 
 -pedja
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: addIndexes() Size

2004-12-06 Thread Otis Gospodnetic
If I were you, I would first use Luke to peek at the index.  You may
find something obvious there, like multiple copies of the same
Document.
Does your temp index 'overlap' with A index in terms of Documents?  If
so, you will end up with multliple copies, as addIndexes method doesn't
detect and remove duplicate Documents.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi.
 
  
 
 Its probably really simple to explain this but since I'm not up to
 speed on
 the way Lucene stores the data I'm a little confused.
 
  
 
 I'm building an Index, which resides on Server A, with the Lucene
 Service
 running on Server B. Now not to bore you with the details but because
 of the
 network transfer rate etc I'm running the actual index on
 \\ServerA\idx
 file:///\\ServerA\idx  and building a temp Index at
 \\ServerB\idx\temp
 file:///\\ServerB\idx\temp  (obviously because the Local FS is much
 faster
 for the service) and then calling addIndexes to import the temp index
 to the
 ServerA index before destroying the ServerB index, holding for a bit
 and
 then checking for new documents.
 
  
 
 All works grand BUT the size of the resultant index on ServerA is
 HUGE in
 comparison to one I'd build from start to finish (i.e. a simple
 addDocument
 Index) - 38gig for 220,000 Unstored Items cannot be right (to give
 you and
 idea of how mad this seems, the backed up version of the database
 from which
 the data is pulled is only 2gigs)
 
  
 
 I've considered it being perhaps the number of Items that had to be
 integrated each time addIndexes was called - right now I'm adding
 around
 10,000 at a time (I had done 1000 at a time but this looked like it
 was
 going to end up even larger still)
 
  
 
 I'm holding off twiddling the minMergeDocs and mergeFactor until I
 can get a
 better understanding of whats going on here.
 
  
 
 Many thanks for any reply's
 
 Garrett
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index delete failing

2004-12-06 Thread Otis Gospodnetic
This smells like a Windows issue.  It is possible that something in
your JVM is still holding onto the index directory (for example,
FSDirectory), and Winblows is not letting you remove the directory.  I
bet this will work if you exit the JVM and run java.io.file.delete()
without calling Lucene.  Sorry, my Windows + Lucene experience is
limited.

Otis

--- Ravi [EMAIL PROTECTED] wrote:

  Hi 
  We need to delete a lucene index from our application using
 java.io.file.delete(). We are closing the indexWriter and even all
 the
 index searchers on that folder. But a call to delete returns false.
 There is no lock on the index directory. Interesting thing is that
 the
 deletable and segments files are getting removed. But the rest of the
 .cfs are not. Has somebody had similar problem? 
 
 Thanks in advance,
 Ravi. 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Single Digit Indexing

2004-12-06 Thread Otis Gospodnetic
Hm, if you can index 11, you should be able to index 8 as well.  In any
case, you most likely want to make sure that your Analyzer is not just
throwing your numbers out.  This may stillbe up to date:
http://www.jguru.com/faq/view.jsp?EID=538308

See also: http://wiki.apache.org/jakarta-lucene/HowTo

Otis

--- Bill von Ofenheim (LaRC) [EMAIL PROTECTED] wrote:

 How can I get Lucene to index single digits (e.g. 8 as in Gemini
 8)?
 I am able to index numbers with two or more digits (e.g. 11 as in
 Apollo 11).
 
 Thanks,
 Bill von Ofenheim
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is this a bug or a feature with addIndexes?

2004-12-06 Thread Otis Gospodnetic
Hello,

Try changing IndexWriter's mergeFactor variable.  It's 10 by default. 
Change it to 1, for instance.

Otis

--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Greetings,
 
 Ok, so maybe this is common knowledge to most of you but I'm a lamen 
 when it comes to Lucene and
 I couldnt find any details about this after some searching.
 
 When you merge two indexes via addIndexes, does it only work in
 batches 
 (10 or more documents)?
 
 Because I've been banging my head off the wall wondering why my code 
 does not want to index 1 (one) document and
 then I went to run Otis's MemoryVsDisk class from 
 http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html?page=last
 but I didnt use 10,000 documents as suggested, I used 5 and 15
 instead.
 And what do you know, less than 10 it doesnt merge at all while more 
 than 10 it will merge only first 10 documents and
 gently forget about the other 5.
 
 My project requires me to index/update one single document as
 required 
 and make it immediately available for searching.
 
 How do I accomplish this if index merging will not merge less than 10
 
 and in increments of 10, and single indexing doesnt
 seem to do it at all (please see my other post 
 http://marc.theaimsgroup.com/?l=lucene-userm=110237364203877w=2)
 
 thanks
 
 -pedja
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: restricting search result

2004-12-03 Thread Otis Gospodnetic
This is entirely application-specific.  As the simplest approach, you
can index each user's documents in a separate index and use
(Parallel)MultiSearcher to search appropriate indices (which ones are
appropriate to search has to be a part of your app's access control
logic).

Otis


--- Paul [EMAIL PROTECTED] wrote:

 Hi,
 how yould you restrict the search results for a certain user? I'm
 indexing all the existing data in my application but there are
 certain
 access levels so some users should see more results then an other.
 Each lucene document has a field with an internal id and I want to
 restrict on that basis. I tried it with adding a long concatenation
 of
 my ids (+locationId:1 +locationId:3 + ...) but this throws a More
 than 32 required/prohibited clauses in query. exception.
 Any suggestions?
 thx!
 Paul
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-03 Thread Otis Gospodnetic
In my experiments with mergeFactor I found the point of diminishing/no
returns.  If I remember correctly, I hit the limit at mergeFactor of
50.

But here is something from Lucene in Action that you can use to play
with various index tuning factors and see their effect on indexing
performance.  It's simple, and if you want to test all 3 of your
scenarios, you will have to modify it.

package lia.indexing;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
 *
 */
public class IndexTuningDemo {

  public static void main(String[] args) throws Exception {
int docsInIndex  = Integer.parseInt(args[0]);

// create an index called 'index-dir' in a temp directory
Directory dir = FSDirectory.getDirectory(
  System.getProperty(java.io.tmpdir, tmp) +
  System.getProperty(file.separator) + index-dir, true);
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(dir, analyzer, true);

// set variables that affect speed of indexing
writer.mergeFactor   = Integer.parseInt(args[1]);
writer.maxMergeDocs  = Integer.parseInt(args[2]);
writer.minMergeDocs  = Integer.parseInt(args[3]);
writer.infoStream= System.out;

System.out.println(Merge factor:+ writer.mergeFactor);
System.out.println(Max merge docs:  + writer.maxMergeDocs);
System.out.println(Min merge docs:  + writer.minMergeDocs);

long start = System.currentTimeMillis();
for (int i = 0; i  docsInIndex; i++) {
  Document doc = new Document();
  doc.add(Field.Text(fieldname, Bibamus));
  writer.addDocument(doc);
}
writer.close();
long stop = System.currentTimeMillis();
System.out.println(Time:  + (stop - start) +  ms);
  }
}


Otis


--- Chuck Williams [EMAIL PROTECTED] wrote:

 I'm wondering what values of mergeFactor, minMergeDocs and
 maxMergeDocs
 people have found to yield the best performance for different
 configurations.  Is there a repository of this information anywhere?
 
  
 
 I've got about 30k documents and have 3 indexing scenarios:
 
 1.   Full indexing and optimize
 
 2.   Incremental indexing and optimize
 
 3.   Parallel incremental indexing without optimize
 
  
 
 Search performance is critical.  For both cases 1 and 2, I'd like the
 fastest possible indexing time.  For case 3, I'd like minimal pauses
 and
 no noticeable degradation in search performance.
 
  
 
 Based on reading the code (including the javadocs comments), I'm
 thinking of values along these lines:
 
  
 
 mergeFactor:  1000 during Full indexing, and during optimize (for
 both
 cases 1 and 2); 10 during incremental indexing (cases 2 and 3)
 
 minMergeDocs:  1000 during Full indexing, 10 during incremental
 indexing
 
 maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
 incremental indexing
 
  
 
 Do these values seem reasonable?  Are there better settings before I
 start experimenting?
 
  
 
 Since mergeFactor is used in both addDocument() and optimize(), I'm
 thinking of using two different values in case 2:  10 during the
 incremental indexing, and then 1000 during the optimize.  Is changing
 the value like this going to cause a problem?
 
 
 Thanks for any advice,
 
  
 
 Chuck
 
  
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document-Map, Hits-List

2004-12-03 Thread Otis Gospodnetic
Yes, it's not wise to just pull all Document instances from Hits
instance, unless you really need them all.  I don't do that, I really
just provide a wrapper, like this:

/**
 * A simple List implementation wrapping a Hits object.
 *
 * @author Otis Gospodnetic
 * @version $Id: HitList.java,v 1.4 2004/11/11 14:08:33 otis Exp $
 */
public class HitList extends AbstractList
{
private Hits _hits;

/**
 * Creates a new codeHitList/code instance.
 *
 * @param hits codeHits/code to wrap
 */
public HitList(Hits hits)
{
_hits = hits;
}

/**
 * @see java.util.List#get(int)
 */
public Object get(int index)
{
try {
return _hits.doc(index);
} catch (IOException e) {
throw new RuntimeException(e);
}
}

/**
 * @see java.util.List#size()
 */
public int size() {
return _hits.length();
}


...
...

Otis


--- Luke Francl [EMAIL PROTECTED] wrote:

 On Wed, 2004-12-01 at 10:27, Otis Gospodnetic wrote:
 
  This is very similar to what I do - I create a List of Maps from
 Hits
  and its Documents.  So I think this change may be handy, if doable
 (I
  didn't look into changing the two Lucene classes, actually).
 
 
 How do you avoid the problem Eric just mentioned, iterating through
 all
 the Hits at once to populate this data structure?
 
 I do a similar thing, creating a List of asset references from a
 field
 in each Lucene Document in my Hits list (actual data for display
 retrieved from a separate datastore). I was not aware of any
 performance
 problems from doing this, but now I am wondering about the
 implications.
 
 Thanks,
 Luke
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter.optimize and memory usage

2004-12-02 Thread Otis Gospodnetic
Hello and quick answers:

See IndexWriter javadoc and in particular mergeFactor, minMergeDocs,
and maxMergeDocs.  This will let you control the size of your segments,
the frequency of segment merges, the amount of buffered Documents in
RAM between segment merges and such.  Also, you ask about calling
optimize periodically - no need, Lucene should already merge segments
once in a while for you.  Optimize at the end.  You can also experiment
with different JVM args for various GC algorithms.

Otis

--- Chris Hostetter [EMAIL PROTECTED] wrote:

 
 I've been running into an interesting situation that I wanted to ask
 about.
 
 I've been doing some testing by building up indexes with code that
 looks
 like this...
 
  IndexWriter writer = null;
  try {
  writer = new IndexWriter(index, new StandardAnalyzer(),
 true);
  writer.mergeFactor = MERGE_FACTOR;
  PooledExecutor queue = new
 PooledExecutor(NUM_UPDATE_THREADS);
  queue.waitWhenBlocked();
 
  for (int min=low; min  high; min += BATCH_SIZE) {
  int max = min + BATCH_SIZE;
  if (high  max) {
  max = high;
  }
  queue.execute(new BatchIndexer(writer, min, max));
  }
  end = new Date();
  System.out.println(Build Time:  + (end.getTime() -
 start.getTime()) + ms);
  start = end;
  writer.optimize();
  } finally {
  if (null != writer) {
  try { writer.close(); } catch (Exception ignore)
 {/*NOOP*/; }
  }
  }
  end = new Date();
  System.out.println(Optimize Time:  + (end.getTime() -
 start.getTime()) + ms);
 
 
 (where BatchIndexer is a class i have that gets a DB connection, and
 slurps all records from my DB between min and max and builds some
 simple
 Documents out of them and calls writer.addDocument(doc) on each)
 
 This was working fine with small ranges, but then i tried building up
 a
 nice big index for doing some performance testing.  i left it running
 overnight and when i came back in the morning i discovered that after
 successfully building up the whole index (~112K docs, ~1.5GB disk) it
 crashed with an OutOfMemory exception while trying to optimize.
 
 I then realized i was only running my JVM with a 256m upper limit on
 RAM,
 and i figured that PooledExecutor was still in scope, and maybe it
 was
 maintaining some state that was using up a lot of space, so i whiped
 up a
 quick little app to solve my problem...
 
 public static void main(String[] args) throws Exception {
 IndexWriter writer = null;
 try {
 writer = new IndexWriter(index, new StandardAnalyzer(),
 false);
 writer.optimize();
 } finally {
 if (null != writer) {
 try { writer.close(); } catch (Exception ignore) {
 /*NOOP*/; }
 }
 }
 }
 
 ...but I was dissapointed to discover that even this couldn't run
 with
 only 256m of ram.  I bumped it up to 512m and then it manged to
 complete
 successfully (the final index was only 1.1GB of disk).
 
 
 This raises a few questions in my mind:
 
 1) Is there a rule of thumb for knowing how much memory it takes to
optimize an index?
 
 2) Is there a Best Practice to follow when building up a large
 index
from scratch in order to reduce the amount of memory needed to
 optimize
once the whole index is build?  (ie: would spining up a thread
 that
called writer.optimize() every N minutes be a good idea?)
 
 3) Given an unoptimized index that's allready been built (ie: in the
 case
where my builder crashed and i wanted to try and optimize it
 without
having to rebuild from scratch) is there anyway to get IndexWriter
 to
use less RAM and more disk (trading spead for a smaller form
 factor --
and aparently: greater stability so that the app doesn't crash)
 
 
 I imagine that the answers to #1 and #2 are largely dependent on the
 nature of the data in the index (ie: the frequency of terms) but i'm
 wondering if there is a high level formula that could be used to say
 based on the nature of your data, you want to take this approach to
 optimizing when you build


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document-Map, Hits-List

2004-12-01 Thread Otis Gospodnetic
This is very similar to what I do - I create a List of Maps from Hits
and its Documents.  So I think this change may be handy, if doable (I
didn't look into changing the two Lucene classes, actually).

Otis

--- petite_abeille [EMAIL PROTECTED] wrote:

 
 On Dec 01, 2004, at 13:37, Karthik N S wrote:
 
We create a ArrayList Object and Load all the Hit Values into
 them 
  and
  return
the same for Display purpose on a Servlet.
 
 Talking of which...
 
 It would be very handy if org.apache.lucene.search.Hits would
 implement 
 the java.util.List interface... in addition, 
 org.apache.lucene.document.Document could implement java.util.Map...
 
 That way, the rest of the application could pretend to simply have to
 
 deal with a List of Maps, without having to get exposed to any Lucene
 
 internals...
 
 Thought?
 
 Cheers,
 
 PA.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Otis Gospodnetic
Hello,

 Lucene indexing completes in 13-15 hours on the desktop system while
 it completes in about 29-33
 hours on the notebook.
 
 Now, combine it with the DROP INDEX tests completing in the same
 amount of time on both and find
 out why is the search only slightly faster :)
 
  Until then, all your measurements are subjective and you
  don't gain much by comparing the two indexing processes.
 
 I'm worried about searching. Indexing is a lot faster on the desktop
 config.

This tells you that your problem is not the disk itself, and not the
fielsystem.  The bottleneck is elsewhere.

Why not run your search under a profiler?  That will tell you where the
JVM is spending its time.  It may even be in some weird InetAddress
call, like another person already pointed out.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity matrix - more clear

2004-11-30 Thread Otis Gospodnetic
Hello,

I don't think Lucene can spit out the similarity matrix for you, but
perhaps you can use Lucene's Term Vector support to help you build the
matrix yourself:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html

The other relevant sections of the Lucene API to look at are:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int)
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader,%20boolean)
...

This should let you tell Lucene to compute and store term vectors
during indexing, and then you will be able to retrieve a Term Vector
for each Document in the index/collection.  Armed with this data you
should be able to compute similarities between Documents with TV dot
products/cosines, which should be enough for you to build your
similarity matrix.

This sounds like something that would be nice to have in the Lucene
Sandbox, so if you end up with some code that you are allowed to share,
please contribute it back to Lucene.

Otis

--- Roxana Angheluta [EMAIL PROTECTED] wrote:

 Dear all,
 
 Yesterday I've asked a question about geting the similarity matrix of
 a 
 collection of documents from an index, but I got only one answer, so 
 perhaps my question was not very clear.
 
 I will try to reformulate:
 
 I want to use Lucene to have efficient access to an index of a 
 collection of documents. My final purpose is to cluster documents. 
 Therefore I need to have for each pair of documents a number
 signifying 
 the similarity between them.
 A possible solution would be to initialize in turn each document as a
 
 query, do a search using an IndexSearcher and to take from the search
 
 result the similarity between the query (which is in fact a document)
 
 and all the other documents. This is highly redundant, because the 
 similarity between a pair of documents is computed multiple times.
 
 I was wondering whether there is a simpler way to do it, since the
 index 
 file contains all the information needed. Can anyone help me here?
 
 Thanks,
 roxana
 
 PS I know about the project Carrot2, which deals with document 
 clustering, but I think is not appropriate for me because of 2
 reasons:
 1) I need to keep the index on the disk for further reusage
 2) I need to be able to search efficiently in the index
 I thought Lucene can help me here, am I wrong?
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does QueryParser uses Analyzer ?

2004-11-30 Thread Otis Gospodnetic
QueryParser does use Analyzer, see this:

  static public Query parse(String query, String field, Analyzer
analyzer)
   throws ParseException {
QueryParser parser = new QueryParser(field, analyzer); 
return parser.parse(query);
  }

Otis
P.S.
Use lucene-user list, please.


--- Ricardo Lopes [EMAIL PROTECTED] wrote:

 Does the QueryParser class really uses the Analyzer passed to the
 parse 
 method ?
 
 I look at the code and i dont the object beeing used anywhere in the 
 class. The problem is that i am writting an application with lucene
 that 
 searches using a foreign language with latin characters, the indexing
 
 works fine, but the search aparently doesn't call the Analyzer.
 
 Here is an example:
 i have a file that contains the following word: memória
 if i search for: memoria (without the puntuation charecter in the o)
 it 
 finds the word, which is correct
 if i search for: memória (the exact same word) it doesn't find the
 word, 
 because the QueryParser splits the word to mem ria, but if the 
 analyzer were called the ó would be replaced to o. I guess the 
 analyzer isn't called, is this right?
 
 Thanks in advance,
 Ricardo Lopes
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   3   4   5   6   7   8   >