Re: Multiple indexes

2005-03-01 Thread Otis Gospodnetic
Ben, You do need to use a separate instance of those 3 classes for each index yes. But this is really something like: IndexWriter writer = new IndexWriter(); So it's normal code-writing process you don't really have to create anything new, just use existing Lucene API. As for locking,

Re: Ranking Terms

2005-02-26 Thread Otis Gospodnetic
Make sure you are not indexing your documents using the compound index format (default in the newer versions of Lucene). Then you will see the .frq file. Here is an example from one of Simpy's Lucene indices: -rw-r--r--1 simpysimpy 629073 Feb 26 13:14 _1ao.frq Otis --

Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Otis Gospodnetic
The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person from a well-known online

Re: Search Performance

2005-02-18 Thread Otis Gospodnetic
Or you could just open a new IndexSearcher, forget the old one, and have GC collect it when everyone is done with it. Otis --- Chris Lamprecht [EMAIL PROTECTED] wrote: I should have mentioned, the reason for not doing this the obvious, simple way (just close the Searcher and reopen it if a

Re: Document comparison

2005-02-18 Thread Otis Gospodnetic
Matt, Erik and I have some code for this in Lucene in Action, but David Spencer did this since the book was published: http://www.lucenebook.com/blog/announcements/more_like_this.html Otis --- Matt Chaput [EMAIL PROTECTED] wrote: Is there a simple, efficient way to compute similarity of

Re: Search Performance

2005-02-18 Thread Otis Gospodnetic
this leave open file handles? I had a problem where there were lots of open file handles for deleted index files, because the old searchers were not being closed. On Fri, 18 Feb 2005 13:41:37 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: Or you could just open a new IndexSearcher

Re: Concurrent searching re-indexing

2005-02-16 Thread Otis Gospodnetic
Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches:

Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Otis Gospodnetic
Hi, lucene.apache.org seems to work now. Here is the query syntax: http://lucene.apache.org/queryparsersyntax.html [] is used as [BEGIN-RANGE-STRING TO END-RANGE-STRING] Otis --- Jim Lynch [EMAIL PROTECTED] wrote: First I'm getting a The requested URL could not be retrieved

Re: behavioral differences between Field.Keyword and Field.UnStored

2005-02-11 Thread Otis Gospodnetic
The QueryParser is analyzing your Field.Keyword (genre field) fields, because it doesn't know that genre is a Keyword field and should not be analyzed. Check section 4.4. here: http://www.lucenebook.com/search?query=queryparser+keyword Otis --- Mike Rose [EMAIL PROTECTED] wrote: Perhaps

Re: Optimize not deleting all files

2005-02-04 Thread Otis Gospodnetic
Get and try Lucene 1.4.3. One of the older versions had a bug that was not deleting old index files. Otis --- [EMAIL PROTECTED] wrote: Hi, When I run an optimize in our production environment, old index are left in the directory and are not deleted. My understanding is that an

Re: Numbers in the Query String

2005-02-03 Thread Otis Gospodnetic
Using different analyzers for indexing and searching is not recommended. Your numbers are not even in the index because you are using StandardAnalyzer. Use Luke to look at your index. Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Hello, How can one search for a document based on the query

Re: which HTML parser is better?

2005-02-02 Thread Otis Gospodnetic
If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea [EMAIL PROTECTED] wrote: Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No

RE: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Otis Gospodnetic
Adam, Dawid posted some code that lets you use Carrot2 locally with Lucene, without the componentized pipe line system described on Carrot2 site. Otis --- Adam Saltiel [EMAIL PROTECTED] wrote: David, Hi, Would you be able to comment on coincidentally recent thread RE: - Grouping Search

Re: total number of (unique) terms in the index

2005-01-28 Thread Otis Gospodnetic
I don't think there is a direct way to get the number of (unique) terms in the index, so yes, I think you'll have to loop through TermEnum and count. Otis --- Jonathan Lasko [EMAIL PROTECTED] wrote: I'm looking for the total number of unique terms in the index. I see that I can get a

Re: Loading a large index

2005-01-28 Thread Otis Gospodnetic
Edwin, --- Edwin Tang [EMAIL PROTECTED] wrote: I have three indices really that I search via ParallelMultiSearcher. All three are being updated constantly. We would like to be able to perform a search on the indices and have the results reflect the latest documents indexed. However, that

Re: Disk space used by optimize

2005-01-28 Thread Otis Gospodnetic
Morus, that description of 3 sets of index files is what I was imagining, too. I'll have to test and add to the book errata, it seems. Thanks for the info, Otis --- Morus Walter [EMAIL PROTECTED] wrote: Otis Gospodnetic writes: Hello, Yes, that is how optimize works - copies all

Re: Lucene in Action hits desk in UK

2005-01-28 Thread Otis Gospodnetic
: Is Lucene-in-Action being sold anywhere in Singapore? thanks! Otis Gospodnetic [EMAIL PROTECTED] wrote: Gospodnetiæ sounds like Gospodnetich and Eric is Erik :) Otis --- John Haxby wrote: Otis Gospodnetic wrote: I contacted both the US and UK Amazon sites and asked them

Re: Different Documents (with fields) in one index?

2005-01-27 Thread Otis Gospodnetic
Karl, This is completely fine. You can have documents with different fields in the same index. Otis --- Karl Koch [EMAIL PROTECTED] wrote: Hello all, perhaps not such a sophisticated question: I would like to have a very diverse set of documents in one index. Depending on the inside

Re: Boosting Questions

2005-01-27 Thread Otis Gospodnetic
Luke, Boosting is only one of the factors involved in Document/Query scoring. Assuming that by applying your boosts to Document A or a single field of Document A increases the total score enough, yes, that Document A may have the highest score. But just because you boost a single Document and

Re: XML index

2005-01-27 Thread Otis Gospodnetic
Hello Karl, Grab the source code for Lucene in Action, it's got code that parses and indexes XML with DOM and SAX. You can see the coverage of that stuff here: http://lucenebook.com/search?query=indexing+XML+section%3A7* I haven't used kXML, but I imagine the LIA code should get you going

Re: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus optimizing it. see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space However, three times the space sounds a bit too much, or I make a mistake in the book. :) You

Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
I discuss this with myself a lot inside my head... :) Seriously, I agree with Erik. I think this is a business opportunity. How many people are hating me now and going shh? Raise your hands! Otis --- David Spencer [EMAIL PROTECTED] wrote: This reminds me, has anyone every discussed

RE: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
files are: the .cfs (46.8MB), deletable (4 bytes), and segments (29 bytes). --Leto -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus

Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
500 times the original data? Not true! :) Otis --- Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote: Hi, I agree that Google mini is quite expensive. It might be similar to the desktop version in quality. Anyone knows google's ratio of index to text? Is it true that Lucene's index is

Re: Lucene in Action hits desk in UK

2005-01-26 Thread Otis Gospodnetic
Publisher - Amazon information feed seems to be a fairly manual process, and Amazon takes a while to update book information on their site, including prices. I contacted both the US and UK Amazon sites and asked them to fix my last name (the last character in my name has a little slash (not an

Re: Getting Into Search

2005-01-26 Thread Otis Gospodnetic
Hi Luke, That's not hard with RangeQuery (supported by QueryParser), take a look at this: http://www.lucenebook.com/search?query=date+range The grayed-out text has the section name and page number, so you can quickly locate this stuff in your ebook. Otis P.S. Do you know if Indigo/Chapters

Re: Lucene in Action hits desk in UK

2005-01-26 Thread Otis Gospodnetic
Gospodneti#263; sounds like Gospodnetich and Eric is Erik :) Otis --- John Haxby [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: I contacted both the US and UK Amazon sites and asked them to fix my last name (the last character in my name has a little slash (not an accent) above

Re: Search on heterogenous index

2005-01-25 Thread Otis Gospodnetic
Hello Simeon, Heterogenous Documents/indices are OK - check out the second hit: http://www.lucenebook.com/search?query=heterogenous+different Otis --- Simeon Koptelov [EMAIL PROTECTED] wrote: Hello all. I'm new to lucene and think about using it in my project. I have prices with dynamic

Re: Search Chinese in Unicode !!!

2005-01-25 Thread Otis Gospodnetic
I don't have a document with chinese characters to verify this, but it looks right, so I'll add your change to SearchFiles.java. Thanks, Otis --- Eric Chow [EMAIL PROTECTED] wrote: Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in

Re: English and French documents together / analysis, indexing, searching

2005-01-23 Thread Otis Gospodnetic
That would be a partial solution. Accents will not be a problem any more, but if you use an Analyzer than stems tokens, they will not rally be tokenized properly. Searches will probably work, but if you look at the index you will see that some terms were not analyzed properly. But it may be

Re: keep indexes as files or save them in database

2005-01-23 Thread Otis Gospodnetic
A number of people have tried putting Lucene indices in RDBMS. As far as I know, all were slower than FSDirectory. Otis --- nafise hassani [EMAIL PROTECTED] wrote: Hi I want to know from the performance point of view it is better to save lucene indexes in database or use them as files???

Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with

Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M

Re: Lucene in Action

2005-01-22 Thread Otis Gospodnetic
Hi Ansi, If you want the print version, I would guess you could order it from the publisher (http://www.manning.com/hatcher2) or from Amazon and they will ship it to you in China. The electronic version (a PDF file) is also available from the above URL. I'll ask Manning Publications and see

Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
Yes, I remember your email about the large number of Terms. If it can be avoided and you figure out how to do it, I'd love to patch something. :) Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory

Re: Stemming

2005-01-21 Thread Otis Gospodnetic
Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more

RE: Filtering w/ Multiple Terms

2005-01-21 Thread Otis Gospodnetic
This: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html ? You can control that limit via http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount Otis --- Jerry Jalenak [EMAIL PROTECTED] wrote: OK.

Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- Ranjan K. Baisak [EMAIL PROTECTED] wrote: I am planning to move to Lucene but not have much knowledge on the same. The search engine which I had developed is searching some extranet URLs e.g.

Re: Concurrent read and write

2005-01-21 Thread Otis Gospodnetic
Hello Ashley, You can read/search while modifying the index, but you have to ensure only one thread or only one process is modifying an index at any given time. Both IndexReader and IndexWriter can be used to modify an index. The former to delete Documents and the latter to add them. You have

RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that

RE: help in indexing

2005-01-20 Thread Otis Gospodnetic
Hello Chetan, The code that comes with the Lucene book contains a little framework for indexing rich-text documents. It sounds like you may be able to use it as-is, and extending it with a parser for Excel files, which we didn't include in the code (whould we include it in the next edition?).

Re: lucene2.0 and transaction support

2005-01-20 Thread Otis Gospodnetic
The Wiki has some info about Lucene 2.0, but that is all there is about 2.0. Regarding transactions - have you tried DbDirectory? I believe that will provide XA support and it won't require Lucene changes. Otis --- John Wang [EMAIL PROTECTED] wrote: Hi: When is lucene 2.0 scheduled to

Re: Closed IndexWriter reuse

2005-01-20 Thread Otis Gospodnetic
No, you can't add documents to an index once you close the IndexWriter. You can re-open the IndexWriter and add more documents, of course. Otis --- Oscar Picasso [EMAIL PROTECTED] wrote: Hi, Is it safe to add documents to an IndexWriter that has been closed? From what I have seen, the

Re: Why IndexReader.lastModified(index) is depricated?

2005-01-19 Thread Otis Gospodnetic
Going for the segments file like that is not a recommended practise, or at least not something I'd recommend. 'segments' file is really something that a caller should not know anything about. Once day Lucene may choose to rename the segments file or some such, and the code that uses this trick

Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
I didn't pay full attention to this thread, but it sounds like somebody may be interested in RuntimeShutdownHook (or some similar name) as a place to try to release the locks. Otis --- Joseph Ottinger [EMAIL PROTECTED] wrote: On Tue, 11 Jan 2005, Doug Cutting wrote: Joseph Ottinger wrote:

Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
Eh, that exactly :) When I read my emails in reverse order --- Chris Lamprecht [EMAIL PROTECTED] wrote: What about a shutdown hook? Runtime.getRuntime().addShutdownHook(new Thread() { public void run() { /* whatever */ } }); see also

Re: Performance question

2005-01-10 Thread Otis Gospodnetic
Use one index, working with a single index is simpler. Also, once you pull a Document from Hits object, all Fields are read off of the disk. There was some discussion about selective Field reading about a week ago, check the list archives. Also keep in mind Field compression is now possible

Re: Duplicate Id

2005-01-07 Thread Otis Gospodnetic
Hello, If you search for India OR Test, you will find both, if you use AND, you will find none. Lucene can search any text, not just files. It sounds like you are using Lucene's demo as a real application (not a good practise). I suggest you take a look at the Resources page on the Lucene Wiki

Re: reading fields selectively

2005-01-06 Thread Otis Gospodnetic
Hi John, There is no API for this, but I recall somebody talking about adding support for this a few months back. I even think that somebody might have contributed a patch for this. I am not certain about this, but check the patch queue (link on Lucene site). If there is a patch there, even if

Re: Lucene Book in UK

2005-01-06 Thread Otis Gospodnetic
The book is $44.95 USD - it's printed on the back cover. Amazon had the correct price (minus their discount) until recently. They are just very slow with their site/book info updates, but I'm sure they'll fix it eventually. Otis --- Erik Hatcher [EMAIL PROTECTED] wrote: On Jan 6, 2005, at

Re: RemoteSearcher

2005-01-06 Thread Otis Gospodnetic
Nutch (nutch.org) has a pretty sophisticated infrastructure for distributed searching, but it doesn't use RemoteSearcher. Otis --- Yura Smolsky [EMAIL PROTECTED] wrote: Hello. Does anyone know application which based on RemoteSearcher to distribute index on many servers? Yura Smolsky,

Re: Parsing issue

2005-01-04 Thread Otis Gospodnetic
That's the correct place to look and it includes code samples. Yes, it's a Jar file that you add to the CLASSPATH and use ... hm, normally programmatically, yes :). Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Has any one used NekoHTML ? If so how do I use it. Is it a stand alone jar

Re: Help for sorting

2005-01-03 Thread Otis Gospodnetic
Hello, --- mahaveer jain [EMAIL PROTECTED] wrote: I am looking out to implement sorting in my lucene application. This is what my code look like. I am using StandardAnalyzer() analyzer. Query query = QueryParser.parse(keyword, contents, analyzer); Sort sortCol = new Sort(new

Re: how often to optimize?

2004-12-28 Thread Otis Gospodnetic
Correct. The self-maintenance you are referring to is Lucene's periodic segment merging. The frequency of that can be controlled through IndexWriter's mergeFactor. Otis --- aurora [EMAIL PROTECTED] wrote: Are not optimized indices causing you any problems (e.g. slow searches, high number

Re: Need an analyzer that includes numbers.

2004-12-25 Thread Otis Gospodnetic
WhitespaceAnalyzer will let you have it. It just breaks the input on spaces. Otis --- Jim [EMAIL PROTECTED] wrote: I've seen some discussion on this and the answer seems to be write your own. Hasn't someone already done that by now that would share? I really have to be able to include

Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Otis Gospodnetic
Most definitely Jetty. I can't believe you're using Tomcat for Rojo! ;) Otis --- Erik Hatcher [EMAIL PROTECTED] wrote: Wrong list. Though perhaps you should be using Jetty ;) Erik On Dec 23, 2004, at 4:17 PM, Kevin A. Burton wrote: What in the world is up with this

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
Martijn, have you seen the Highlighter in the Lucene Sandbox? If you've stored your text in the Lucene index, there is no need to go back to DB to pull out the blog, parse it, and highlight it - the Highlighter in the Sandbox will do this for you. Otis --- M. Smit [EMAIL PROTECTED] wrote:

Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Otis Gospodnetic
If you are not tied to Java, see 'unac' at http://www.senga.org/. It's old, but if nothing else you could see how it works and rewrite it in Java. And if you can, you can donate it to Lucene Sandbox. Otis --- Peter Pimley [EMAIL PROTECTED] wrote: Hi everyone, The Question: In Java

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
I suspect Martijn really wants that snippet dynamically generated, with KWIC, as on the lucenebook.com screen shot. Thus, he can't generate and store the snippet at index time, and has to construct it at search time. Otis --- Mike Snare [EMAIL PROTECTED] wrote: But for the other issue on

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
For simpy.com I store the full text of web pages in Lucene, in order to provide full-text web searches. Nutch (nutch.org) does the same. You can set the maximal number of tokens you want indexed via IndexWriter. You can also compress fields in the newest version of Lucene (or maybe just the one

Re: addIndexes() Question

2004-12-22 Thread Otis Gospodnetic
I _think_ you'd be better off doing it all at once, but I wouldn't trust myself on this and would instead construct a small 3-index set and test, looking at a) maximal disk usage, b) time, and c) RAM usage. :) Otis --- Ryan Aslett [EMAIL PROTECTED] wrote: Hi there, Im about to embark on a

Re: index size doubled?

2004-12-21 Thread Otis Gospodnetic
Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing

Re: how often to optimize?

2004-12-21 Thread Otis Gospodnetic
Hello, I think some of these questions my be answered in the jGuru FAQ So my question is would it be an overkill to optimize everyday? Only if lots of documents are being added/deleted, and you end up with a lot of index segments. Is there any guideline on how often to optimize?

Re: analyzer effecting phrases?

2004-12-20 Thread Otis Gospodnetic
When searching for phrases, what's important is the position of each token/word extracted by the Analyzer. WhitespaceAnalyzer/LowerCaseFilter don't do anything with the positional information. There is nothing else in your Analyzer? In any case, the following should help you see what your

RE: Queries difference

2004-12-20 Thread Otis Gospodnetic
Alex, I think you want this: +city:London +city:Amsterdam +address:1_street +address:2_street Otis --- Alex Kiselevski [EMAIL PROTECTED] wrote: Thanks Morus So if I understand right If the seqond query is : +city(London) +city(Amsterdam) +address(1_street) +address(2_street) Both

Re: Indexing with Lucene 1.4.3

2004-12-17 Thread Otis Gospodnetic
The only place where you have to specify that you are using the compound index format is on IndexWriter instance. Nothing needs to be done at search time on IndexSearcher. Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Thanks Chuck, I now understand why I see only one file. Another question

Re: Disk space needed for indexing???

2004-12-16 Thread Otis Gospodnetic
The exact disk space usage depends on the number of fields in the index and on how many of them store the original text. You should also keep in mind that the call to IndexWriter's optimize() will result in your index directory size doubling while the optimization is in progress, so if you want

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-16 Thread Otis Gospodnetic
Hello, As Erik already said - that Analyzer is really there to get people going quickly and as a 'does pretty good' Analyzer. There is no Analyzer that will work for everyone, and Analyzers are meant to be custom-made. It looks like you already got that figured out and have your own Analyzer.

Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
--- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs

RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size,

Re: A question about scoring function in Lucene

2004-12-15 Thread Otis Gospodnetic
There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score

Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable

RE: TFIDF Implementation

2004-12-14 Thread Otis Gospodnetic
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Otis --- Bruce Ritchie [EMAIL PROTECTED] wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like

RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
Well, one could always partition an index, distribute pieces of it horizontally across multiple 'search servers' and use the built-in RMI-based and Parallel search feature. Nutch uses something similar for search scaling. Otis --- Monsur Hossain [EMAIL PROTECTED] wrote: My concern is that

Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable if you use some of the 'internal' methods. I don't have the need for this, but others might, so

RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
You can see Flickr-like tag (lookup) system at my Simpy site ( http://www.simpy.com ). It uses Lucene as the backend for lookups, but still uses a RDBMS as the primary storage. I find it that keeping the RDBMS and Lucene indices is a bit of a pain and error prone, so _thin_ storage layer with

Re: Indexing a large number of DB records

2004-12-14 Thread Otis Gospodnetic
Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this:

Re: Indexing HTML files give following message

2004-12-12 Thread Otis Gospodnetic
Hello, This is probably due to some bad HTML. The application you are using is just a demo, and uses a JavaCC-based HTML parser, which may not be resilient to invalid HTML. For Lucene in Action we developed a little extensible indexing framework, and for HTML indexing we used 2 tools to handle

Re: Finding unused segment files?

2004-12-12 Thread Otis Gospodnetic
polluted. * * TODO: this tool should really lock the directory for writing before * removing any Lucene segment files, otherwise this tool itself may * corrupt the index. * * @author Otis Gospodnetic * @version $Id$ */ public class SegmentPurger { // TODO: copied from SegmentMerger

RE: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Otis Gospodnetic
Ying, You should follow this finally block advice below. In addition, I think you can just close the reader, and it will close the underlying stream (I'm not sure about that, double-check it). You are not running out of file handles, though. Your JVM is running out of memory. You can play

Re: maxDoc()

2004-12-09 Thread Otis Gospodnetic
Hello Garrett, Share some code, it will be easier for others to help you that way. Obviously, this would be a huge bug if the problem were within Lucene. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Can anyone please explain to my why maxDoc returns 0 when Luke shows 239,473 documents?

Re: problem restoring index

2004-12-08 Thread Otis Gospodnetic
There is no need to reindex. However, I also don't quite get what the problem is :) Otis --- Santosh [EMAIL PROTECTED] wrote: hi, when I restart the tomcat . the Index is getting corrupted. If I take the backup of Index and then restarting tomcat. the Index is not working properly.

Re: searchig with special characters

2004-12-08 Thread Otis Gospodnetic
Leading wildcard character (*) is not allowed if you use QueryParser that comes with Lucene. Reason: performance. See many discussions about this on lucene-user mailing list. Also see the search sytax document on the Lucene site. What other characters are you having trouble with? Otis ---

Re: Empty/non-empty field indexing question

2004-12-08 Thread Otis Gospodnetic
? But then is there a point putting an empty value in it, if an application will never search for empty values? thanks -pedja Otis Gospodnetic said the following on 12/8/2004 1:31 AM: Empty fields won't add any value, you can skip them. Documents in an index don't have

Re: 'IN' type search

2004-12-08 Thread Otis Gospodnetic
Hello, You can use BooleanQuery for that. Otis --- Ravi [EMAIL PROTECTED] wrote: Hi How do you get all documents in lucene where a particular field value is in a given list of values (like SQL IN). What kind of Query class should I use? Thanks in advance. Ravi.

RE: When is the book released?

2004-12-07 Thread Otis Gospodnetic
Hello, Yes, Lucene in Action has been listed on Amazon for a while now (I think I recorded this in my blog some time back). The publish date is, I believe, the date provided by publishers, but things almost always take longer than predicted, so 31.12.2004 may be a bit off. :( However, the ebook

Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Otis Gospodnetic
If you run the same query again, the IndexSearcher will go all the way to the index again - no caching. Some caching will be done by your file system, possibly, but that's it. Lucene is fast, so don't optimize early. Otis --- Ben Rooney [EMAIL PROTECTED] wrote: thanks chris, you are

Re: Empty/non-empty field indexing question

2004-12-07 Thread Otis Gospodnetic
Empty fields won't add any value, you can skip them. Documents in an index don't have to be uniform. Each Document could have a different set of fields. Of course, that has some obvious implications for search, but is perfectly fine technically. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED]

Re: addIndexes() Size

2004-12-06 Thread Otis Gospodnetic
If I were you, I would first use Luke to peek at the index. You may find something obvious there, like multiple copies of the same Document. Does your temp index 'overlap' with A index in terms of Documents? If so, you will end up with multliple copies, as addIndexes method doesn't detect and

Re: Index delete failing

2004-12-06 Thread Otis Gospodnetic
This smells like a Windows issue. It is possible that something in your JVM is still holding onto the index directory (for example, FSDirectory), and Winblows is not letting you remove the directory. I bet this will work if you exit the JVM and run java.io.file.delete() without calling Lucene.

Re: Single Digit Indexing

2004-12-06 Thread Otis Gospodnetic
Hm, if you can index 11, you should be able to index 8 as well. In any case, you most likely want to make sure that your Analyzer is not just throwing your numbers out. This may stillbe up to date: http://www.jguru.com/faq/view.jsp?EID=538308 See also:

Re: Is this a bug or a feature with addIndexes?

2004-12-06 Thread Otis Gospodnetic
Hello, Try changing IndexWriter's mergeFactor variable. It's 10 by default. Change it to 1, for instance. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Greetings, Ok, so maybe this is common knowledge to most of you but I'm a lamen when it comes to Lucene and I couldnt find any

Re: restricting search result

2004-12-03 Thread Otis Gospodnetic
This is entirely application-specific. As the simplest approach, you can index each user's documents in a separate index and use (Parallel)MultiSearcher to search appropriate indices (which ones are appropriate to search has to be a part of your app's access control logic). Otis --- Paul

Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-03 Thread Otis Gospodnetic
In my experiments with mergeFactor I found the point of diminishing/no returns. If I remember correctly, I hit the limit at mergeFactor of 50. But here is something from Lucene in Action that you can use to play with various index tuning factors and see their effect on indexing performance.

Re: Document-Map, Hits-List

2004-12-03 Thread Otis Gospodnetic
Yes, it's not wise to just pull all Document instances from Hits instance, unless you really need them all. I don't do that, I really just provide a wrapper, like this: /** * A simple List implementation wrapping a Hits object. * * @author Otis Gospodnetic * @version $Id: HitList.java,v 1.4

Re: IndexWriter.optimize and memory usage

2004-12-02 Thread Otis Gospodnetic
Hello and quick answers: See IndexWriter javadoc and in particular mergeFactor, minMergeDocs, and maxMergeDocs. This will let you control the size of your segments, the frequency of segment merges, the amount of buffered Documents in RAM between segment merges and such. Also, you ask about

Document-Map, Hits-List

2004-12-01 Thread Otis Gospodnetic
This is very similar to what I do - I create a List of Maps from Hits and its Documents. So I think this change may be handy, if doable (I didn't look into changing the two Lucene classes, actually). Otis --- petite_abeille [EMAIL PROTECTED] wrote: On Dec 01, 2004, at 13:37, Karthik N S

Re: What is the best file system for Lucene?

2004-11-30 Thread Otis Gospodnetic
Hello, Lucene indexing completes in 13-15 hours on the desktop system while it completes in about 29-33 hours on the notebook. Now, combine it with the DROP INDEX tests completing in the same amount of time on both and find out why is the search only slightly faster :) Until then, all

Re: similarity matrix - more clear

2004-11-30 Thread Otis Gospodnetic
Hello, I don't think Lucene can spit out the similarity matrix for you, but perhaps you can use Lucene's Term Vector support to help you build the matrix yourself: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html The other relevant sections of the Lucene API

Re: Does QueryParser uses Analyzer ?

2004-11-30 Thread Otis Gospodnetic
QueryParser does use Analyzer, see this: static public Query parse(String query, String field, Analyzer analyzer) throws ParseException { QueryParser parser = new QueryParser(field, analyzer); return parser.parse(query); } Otis P.S. Use lucene-user list, please. --- Ricardo

  1   2   3   4   5   6   7   8   >