RE: new Lucene release: 1.2 RC2

2001-10-22 Thread Doug Cutting
From: Sunil Zanjad [mailto:[EMAIL PROTECTED]] Indexes left in an inconsistent state on crash (i don't remember who I believe that even I have reported it. This happens on abrupt exit of the JVM To do this I had one thread updating a directory containing many .txt files and

RE: Context specific summary with the search term

2001-10-22 Thread Doug Cutting
From: Lee Mallabone [mailto:[EMAIL PROTECTED]] I'm trying to implement this and should be able to contribute any succesful results, but I need to produce context on a per-field basis. Eg. if I got a token hit in the text body of a document, but the first hit token was a word in the section

RE: Querying an exact string match ?

2001-10-31 Thread Doug Cutting
This should work. You should be able to find an un-tokenized field containing spaces with a TermQuery. Nothing should ever tokenize the string. Can you please supply a simple, self-contained example showing that this does not work? Thanks, Doug -Original Message- From: Winton

RE: Problems with prohibited BooleanQueries

2001-11-01 Thread Doug Cutting
From: Scott Ganyo [mailto:[EMAIL PROTECTED]] How difficult would it be to get BooleanQuery to do a standalone NOT, do you suppose? That would be very useful in my case. It would not be that difficult, but it would make queries slow. All terms not containing a term would need to be

RE: Do range queries work?

2001-11-01 Thread Doug Cutting
Can folks please try to include complete, self-contained test cases when submitting bugs? It's not that hard, and makes it much easier to figure out what is going on. For example, I have attached a complete, self-contained test case for the bug reported below. It only took 50 lines.

RE: Do range queries work?

2001-11-01 Thread Doug Cutting
From: Paul Friedman [mailto:[EMAIL PROTECTED]] It looks like there is a bug (besides the StandardAnalyzer parsing 20-35 as a single term). The query in your example: search(searcher, analyzer, FirstName:[a-k]); is not finding the correct document. It is finding doc2, it

RE: Memory Usage?

2001-11-12 Thread Doug Cutting
) org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114) org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosRead er.java:166) I've attached the whole trace as gzipped.txt regards, Anders Nielsen -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: 10. november 2001 04:35

RE: Memory Usage?

2001-11-12 Thread Doug Cutting
From: Anders Nielsen [mailto:[EMAIL PROTECTED]] hmm, I seem to be getting a different number of hits when I use the files you sent out. Please provide more information! Is it larger or smaller than before? By how much? What differences show up in the hits? That's a terrible bug

RE: Sorting Options for Query Results

2001-11-19 Thread Doug Cutting
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] I think this still works if the the document number continue to increase by one when documents are added incrementally. Does anyone know if this is true (I haven't looked at the code yet). Yes, that is true, so long as you do not delete

RE: Attribute Search

2001-11-26 Thread Doug Cutting
From: New, Cecil (GEAE) [mailto:[EMAIL PROTECTED]] this is exactly what I was doing. Store=false, index=true, and token=false. It appeared to work ok, but searches *never* returned any hits. That's why I suspect it is a bug. If you think this is a bug, please submit a test case, as

RE: IndexReader and IndexWriter on the same index

2001-11-27 Thread Doug Cutting
If you are performing additions and deletions then you should serially create an IndexReader to do deletions, close it, then create an IndexWriter to do additions, close it, and so on. Note that typically one will use a different IndexReader for deletions than is used for searching, so that

RE: Parallelising a query...

2001-11-29 Thread Doug Cutting
From: Winton Davies [mailto:[EMAIL PROTECTED]] I have 4 million documents... I could: Split these into 4 x 1 million document indexes and then send a query to 4 Lucene processes ? At the end I would have to sort the results by relevance. Question for Doug or any other

RE: Transactional Indexing

2001-11-29 Thread Doug Cutting
From: New, Cecil (GEAE) [mailto:[EMAIL PROTECTED]] I have noticed that when I kill/interrupt an indexing process, that it leaves a lock file, preventing further indexing. This raises a couple of questions: a. When I simply delete the file and restart the indexing, it seems to work. Is

RE: Parallelising a query...

2001-11-29 Thread Doug Cutting
TermDocs are ordered by document number. It would not be easy to change this. Doug -Original Message- From: Winton Davies [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 29, 2001 11:12 AM To: Lucene Users List Subject: Re: Parallelising a query... Hi again

RE: Does Lucene really work with Java 1.1.8

2001-10-09 Thread Doug Cutting
From: Brook, James [mailto:[EMAIL PROTECTED]] I am trying to use the 'lucene-1.2-rc1.jar' with a WebObjects 4.5 application, but having problems. WebObjects uses Java 1.1.8. I read on the jGuru Lucene FAQ that Lucene should work with this version of Java. Is this correct? It should,

RE: File Handles issue

2001-10-11 Thread Doug Cutting
From: Scott Ganyo [mailto:[EMAIL PROTECTED]] We're having a heck of a time with too many file handles around here. When we create large indexes, we often get thousands of temporary files in a given index! Thousands, eh? That seems high. The maximum number of segments should be

RE: File Handles issue

2001-10-15 Thread Doug Cutting
From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Thanks for the detailed information, Doug! That helps a lot. Based on what you've said and on taking a closer look at the code, it looks like by setting mergeFactor and maxMergeDocs to Integer.MAX_VALUE, an entire index will be built in a

RE: number of terms vs. number of fields

2001-12-03 Thread Doug Cutting
Lucene counts the same string in different fields as a different term. In other words, a term is composed of a field and a string. Doug -Original Message- From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] Sent: Saturday, December 01, 2001 6:55 PM To: [EMAIL PROTECTED] Subject:

RE: prefix query with multiple words

2001-12-04 Thread Doug Cutting
In short, this is not currently supported, but might be someday. For more details, see my recent response to a message with subject RE: Near without slop. Doug -Original Message- From: Tom Barrett [mailto:[EMAIL PROTECTED]] Sent: Monday, December 03, 2001 3:42 PM To: [EMAIL

RE: Near without slop

2001-12-04 Thread Doug Cutting
From: Paddy Clark [mailto:[EMAIL PROTECTED]] My current NEAR solution is to modify the query parser to build a PhraseQuery from the terms surrounding NEAR and set the slop correctly. This works for this kind of query: Bob NEAR Jim The problem comes when I try microsoft NEAR app*

RE: Industry Use of Lucene?

2001-12-07 Thread Doug Cutting
Kelvin, I don't seen powered by Lucene on your results pages: http://www.relevanz.com/Search?query=media If you add this, we can add you to the Powered by Lucene page: http://jakarta.apache.org/lucene/docs/powered.html What other sites should be added to this page? Doug -Original

RE: Term ordering for IndexReader.termDocs()

2002-01-25 Thread Doug Cutting
From: Ype Kingma [mailto:[EMAIL PROTECTED]] I'm creating a filter from a set of terms that are read from a file, and I find that IndexReader.termDocs(Term(fieldName, valueFromFile)) does this quite well (around 0.1 secs elapsed time in jython code.) Would it be advantageous to sort the

RE: strange search problems(cannot query for more than the first 10000 words!?!)

2002-01-28 Thread Doug Cutting
From: Karl Øie [mailto:[EMAIL PROTECTED]] I have created a testclass for working with Analyzers and ran into a strange problem; I cannot search for text in fields with more than 1 words!?!? Lucene by default stops indexing after the 10,000th token. See

release 1.2 RC3

2002-01-28 Thread Doug Cutting
A new release of Lucene is available, 1.2 release candidate 3. The new release can be downloaded from: http://jakarta.apache.org/builds/jakarta-lucene/release/v1.2-rc3/ If no major problems are identified in the next few days, we will make a 1.2 final release--the first final release since

RE: Moving Index from Crawl/Build Server to Search Server

2002-01-31 Thread Doug Cutting
From: Mark Tucker [mailto:[EMAIL PROTECTED]] What is the best way to move the index from the build server to the search servers and then change which index a user is searching against? I am concerned about switching the index while a user is paging through search results. Ideally

RE: Obtaining all results efficiently. Closing a searcher.

2002-01-31 Thread Doug Cutting
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Are you implying ( ... public synchronized Searcher getSearcher()) to use this synchronized method in a servlet/jsp thread as well? Yes. Your jhtml example doesn't appear to synchronzied. Maybe I'm missing something though.

RE: Indexing and Searching happening together

2002-02-01 Thread Doug Cutting
From: Kelvin Tan [mailto:[EMAIL PROTECTED]] True (and it's great) that once an IndexReader is open, no actions on the IndexWriter affect it. However, if an IndexReader is opened _after_ indexing begins, I suppose it'll throw an exception? Doesn't it mean that when indexing is taking

RE: PhraseQuery: NullPointerException

2002-02-08 Thread Doug Cutting
This bug has been fixed. The fix will be in tonight's nightly build. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE: problems with last patch (obtain write.lock while deleting documents)

2002-02-10 Thread Doug Cutting
From: Daniel Calvo [mailto:[EMAIL PROTECTED]] I've just updated my version (via CVS) and now I'm having problems with document deletion. I'm trying to delete a document using IndexReader's delete(Term) method and I'm getting an IOException: java.io.IOException: Index locked for write:

RE: PrefixQuery Scoring

2002-02-13 Thread Doug Cutting
From: Jonathan Franzone [mailto:[EMAIL PROTECTED]] Whenever I add a PrefixQuery to my search the scoring gets really small. For example if I do a query like this: +java then the scoring starts around 0.866... and so forth. But if I do a query like this: +java* then the scoring start

RE: using lucene with a very large index

2002-02-14 Thread Doug Cutting
From: tal blum [mailto:[EMAIL PROTECTED]] 2) Does the Document id changes after merging indexes adding or deleting documents? Yes. 4) assuming I have a term query that has a large number of hits say 10 millions, is there a way to get the say the top 10 results without going through

RE: write.lock file

2002-02-14 Thread Doug Cutting
I cannot replicate the problem you are having. Can you please submit a complete, self-contained, test case illustrating the problem you are having with the write lock. Please test this against the latest nightly build of Lucene, from: http://jakarta.apache.org/builds/jakarta-lucene/nightly/

RE: Lucene Query Structure

2002-02-19 Thread Doug Cutting
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] After considerable study of the documentation, I am still confused about the semantics of BooleanQuery. Now, as sjb pointed out, (query, false, false) doesn't really seem to have the semantics of a boolean OR. In fact, it does. In

RE: Qs re: document scoring and semantics

2002-02-19 Thread Doug Cutting
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] Is either of the expressions below the correct parenthesization of the expression above? If not, what is? score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) * boost_t) * coord_q_d That's correct. The tf*idf weights

RE: Googlifying lucene querys

2002-02-25 Thread Doug Cutting
If you put the title in a separate field from the contents, and search both fields, matches in the title will usually be stronger, without explicit boosting. This is because the scores are normalized by the length of the field, and the title tends to be much shorter than the contents. So even

RE: Googlifying lucene querys

2002-02-25 Thread Doug Cutting
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] You cannot, in general, structure a Lucene query such that it will yield the same document rankings that Google would for that (query, document set). The reason for this is that Google employs a scoring algorithm that includes

RE: Boolean Query Parsing with IN keyword

2002-02-26 Thread Doug Cutting
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] But, StandardAnalyzer is no longer final (get the latest build) and you can write a class that subclasses it Right. To flesh out Otis' example of how to change StandardAnalyzer's stop list by defining a subclass of it: public class

RE: Wildcard Searching

2002-02-27 Thread Doug Cutting
From: Howk, Michael [mailto:[EMAIL PROTECTED]] Also, Lucene returns the parsed version of each of our searches. When we search by rou*d, Lucene parses it as rou*d (which is what we would expect). But when we search by rou?d, Lucene parses it as rou d. It seems to wrap the term in

RE: Optimization and deletes

2002-02-28 Thread Doug Cutting
From: Aruna Raghavan [mailto:[EMAIL PROTECTED]] I have noticed that unless I optimize the indexing while adding documents to it, the deleted documents are not getting physically deleted right away (even though they seemed to have been flagged as deleted. The searcher could not find

RE: corrupted index

2002-04-02 Thread Doug Cutting
Hinrich, Can you please send a stack trace? As others have mentioned, there isn't an index integrity checker. Doug P.S. Hi! How are you? -Original Message- From: H S [mailto:[EMAIL PROTECTED]] Sent: Monday, April 01, 2002 5:26 PM To: [EMAIL PROTECTED] Subject: corrupted index

RE: QueryParser question - case-sensitivity

2002-05-09 Thread Doug Cutting
[I'm resending this from a different account, since my first attempt is bogged down somewhere. A second copy will probably show up tomorrow, but in the interests of solving this problem sooner, I'm resending it. Sorry for the duplicaton.] Define an Analyzer that does not lowercase the id

Re: Weighted index

2002-06-24 Thread Doug Cutting
Peter Carlson wrote: I don't know the actual algorithm, but when you type in the search title:hello^3 AND heading:dolly^4 Will product different document scores than title:hello AND heading:dolly^4 Lucene will get the score for a given document, not a field. So it does combine the

Re: Weighted index

2002-06-24 Thread Doug Cutting
Peter Carlson wrote: I don't know the actual algorithm, but when you type in the search title:hello^3 AND heading:dolly^4 Will product different document scores than title:hello AND heading:dolly^4 Lucene will get the score for a given document, not a field. So it does combine the

Re: Crash / Recovery Scenario

2002-07-10 Thread Doug Cutting
Karl Øie wrote: If a crash happends during writing happens there is no good way to know if the index is intact, removing lock files doesn't help this fact, as we really don't know. So providing rollback functionality is a good but expensive way of compensating for lack of recovery. The

Re: Crash / Recovery Scenario

2002-07-10 Thread Doug Cutting
Karl Øie wrote: A better solution would be to hack the FSDirectory to store each file it would store in a file-directory as a serialized byte array in a blob of a sql table. This would increase performance because the whole Directory don't have to change each time, and it doesn't have to

Re: CachedSearcher

2002-07-15 Thread Doug Cutting
Halcsy Pter wrote: A lot of people requested a code to cache opened Searcher objects until the index is not modified. The first version of this was writed by Scott Ganyo and submitted as IndexAccessControl to the list. Now I've decoupled the logic that is needed to manage searher.

Re: CachedSearcher

2002-07-16 Thread Doug Cutting
Scott Ganyo wrote: I'd like to see the finalize() methods removed from Lucene entirely. In a system with heavy load and lots of gc, using finalize() causes problems. [ ... ] External resources (i.e. file handles) are not released until the reader is closed. And, as many have found,

Re: CachedSearcher

2002-07-16 Thread Doug Cutting
Hang Li wrote: Why there are so many final and package-protected methods? The package private stuff was motivated by Javadoc. When I wrote Lucene I wanted the Javadoc to make it easy to use. Thus I did not want the Javadoc cluttered with lots of methods that 99% of users did not need to

Re: CachedSearcher

2002-07-17 Thread Doug Cutting
Halcsy Pter wrote: I made an IndexReaderCache class from the code you have sent (the code in demo/Search.jhtml). But this causes exception: IndexSearcher searcher = new IndexSearcher(cache.getReader(/data/index)); searcher.close(); searcher = new

Re: Modifying scores

2002-07-23 Thread Doug Cutting
Mike Tinnes wrote: I'm trying to implement a HITS/PageRank type algorithm and need to modify the document scores after a search is performed. The final score will be a combination of the lucene score and PageRank. Is there currently a way to modify the scores on the fly via HitCollector? so

Re: Numeric Support

2002-07-26 Thread Doug Cutting
Armbrust, Daniel C. wrote: I don't know what a good numbers implementation is, but the way that I do it now, with filters on the bit set after they come back just feels like a hack. Even if bit sets are very fast, it doesn't seem right to iterate over nearly the entire set of terms to filter

Re: Deleting Problem

2002-08-01 Thread Doug Cutting
Terry Steichen wrote: fine now. (I thought I read someplace that you didn't have to optimize after a delete, but if I don't, it doesn't seem to work.) You don't need to optimize after delete for search results to be correct. However IndexReader.docFreq() may be incorrect until you've

Re: Full List of Stop Words for Standard Analyzer.

2002-08-02 Thread Doug Cutting
Ian Lea wrote: In org/apache/lucene/analysis/standard/StandardAnalyzer.java. The source code for the current release is also on the website. In particular, this file is available as: http://jakarta.apache.org/lucene/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java Doug

Re: Lucene's Ranking Function

2002-09-11 Thread Doug Cutting
Clemens Marschner wrote: 1. I think the new document boost is missing, isn't it? With that it should be something like score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d * boost_d Is that correct? Almost. This should actually be boost_d * boost_d_t,

Re: Is Lucene suitable for one-time index and one-time search ?

2002-09-21 Thread Doug Cutting
Mailing Lists Account wrote: I need to search a bunch of documents.Each document needs to be searched only once. That means once I build the index and search it, I have no need for that index and the document again. This does not sound like the problem that Lucene is designed to solve.

Re: 1.2 source jar incomplete?

2002-10-03 Thread Doug Cutting
Ype Kingma wrote: I extracted again, and found my problem: One of the extracted files is lucene-1.2-src.jar. When unzipping this you get a directory tree with only the directories mentioned. As I recall, this jar contains only those java source files that are generated by JavaCC. I don't

Re: Deleting a document found in a search

2002-10-09 Thread Doug Cutting
[EMAIL PROTECTED] wrote: My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. That sounds like a good way to solve this. You could also use a HitCollector with a Query, but I think the

Re: Question: using boost for sorting

2002-10-16 Thread Doug Cutting
This looks like a good approach. When I get a chance, I'd like to make Similarity an interface or an abstract class, whose default implementation would do what the current class does, but whose methods can be overridden. Then I'd add methods like: public static void

Re: retrieving term positions during the search process

2002-10-16 Thread Doug Cutting
Stephan Grimm wrote: Is there a way to retrieve the original term positions during the search process invoked by Searcher.search()? In addition to the documents and their scores we want to have access to the positions of the terms found in order to do a highlighting. We don't want to perform

Re: Enabling URL-based read access to the search index

2002-10-16 Thread Doug Cutting
Schaeffer, David wrote: I am planning to upgrade from Lucene 1.0 to Jakarta Lucene 1.2. My current implementation uses Jason Pell's URLDirectory class so that Lucene can access the search index while running in an applet. I modified IndexReader.java to use URLDirectory instead of

Re: hit scoring on latest build

2002-11-04 Thread Doug Cutting
If you check the CHANGES file for changes made since the 1.2 release, you'll find: Added support for boosting the score of documents and fields via the new methods Document.setBoost(float) and Field.setBoost(float). Note: This changes the encoding of an indexed value. Indexes should

Re: Several fields with the same name

2002-11-06 Thread Doug Cutting
Right. Use the fields() iterator to scan for multiple Field instances with the same name(). Doug Rob Outar wrote: Would the solution be to call Document.fields(), iterate through that enum and get my data? Thanks, Rob -Original Message- From: Rob Outar

Re: Deleting fields from a Document

2002-11-12 Thread Doug Cutting
Kelvin Tan wrote: Does an in-memory Field guarantee access to its name and value? Say I retrieve a Field from a Document A, and add it to a new Document B. Before writing B to the index, I delete A. Would B still contain the Field? If so, does it work for both String-based and Reader-based

Re: has this exception been seen before

2002-11-12 Thread Doug Cutting
A self-contained, reproducible test case is required before someone can really start looking at it. What is the history of this index? Have attempts to update it ever failed prior to this? Doug Avi Drissman wrote: At 8:56 AM -0400 9/20/02, you wrote: Because of this problem, this issue

Re: Mushrooming Index Files

2002-11-12 Thread Doug Cutting
My guess is that you have around 40 fields. Each field requires a separate file in each segment. Can you combine any of your fields? Terry Steichen wrote: I need to modify my original issue below. I was in error - the optimization does indeed bring the total number of index files back to 46.

Re: Searching Ranges

2002-11-12 Thread Doug Cutting
Isn't the break on line 162 of RangeQuery.java supposed to achieve this? Alex Winston wrote: otis, i was able to fix the junit build problems, with the newest versions of ant in regards to lucene unit tests. it appears that the junit.jar must appear in the $ANT_HOME/lib dir in order to run

Re: How to get all field names

2002-11-12 Thread Doug Cutting
This would not be hard to implement. It would take something like: public abstract String[] IndexReader.getFieldNames(); This would need to be implemented in two classes, SegmentReader and SegmentsReader. The former would just access its fieldInfos field to list fields. The latter would

Re: Stress/scalability testing Lucene

2002-11-20 Thread Doug Cutting
and writing at the same time? I thought I read this in the FAQ. Roy. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 20, 2002 5:04 PM To: Lucene Users List Subject: Re: Stress/scalability testing Lucene * Replies will be sent through Spamex to [EMAIL

Re: Updating documents

2002-11-22 Thread Doug Cutting
A deletion is only visible in other IndexReader instances created after the IndexReader where you made the deletion is closed. So if you're searching using a different IndexReader, you need to re-open it after the deleting IndexReader is closed. The lastModified method helps you to figure

Re: How does delete work?

2002-11-22 Thread Doug Cutting
mergeFactor? --- Doug Cutting [EMAIL PROTECTED] wrote: The data is actually removed the next time its segment is merged. Optimizing forces it to happen, but it will also eventually happen as more documents are added to the index, without optimization. Scott Ganyo wrote: It just marks the record

Re: How does delete work?

2002-11-22 Thread Doug Cutting
. Is this right? Thanks, Otis --- Doug Cutting [EMAIL PROTECTED] wrote: Merging happens constantly as documents are added. Each document is initially added in its own segment, and pushed onto the segment stack. Whenever there are mergeFactor segments on the top of the stack that are the same size

Re: How does delete work?

2002-11-23 Thread Doug Cutting
Clemens Marschner wrote: So what if documents are deleted in the meantime? Then the recursive merge can't determine the X segments with the same size. If you read my previous message you'll find the answer: Doug Cutting wrote: It's actually a little more complicated than that, since (among

Re: Keyword fields which don't contribute to a document's score?

2002-12-06 Thread Doug Cutting
In the pre-release version available in the nightly builds you can boost document fields at index time. Check out the CHANGES.txt file for details. Doug Ashley Collins wrote: Is it possible to stop keyword fields contributing to a document's score? Leaving only text fields? Is the best way

Re: Indexing in a CBD Environment

2002-12-10 Thread Doug Cutting
I'm not sure I understand the question, but I'll hazard an answer anyway. Might it work to maintain separate indexes for B, C, E and F, then use a MultiSearcher to search them all? That would keep updates local... Doug Cohan, Sean wrote: I am a total newbie to Lucene. We are developing

Re: Empty phrase search

2002-12-17 Thread Doug Cutting
I believe that the underlying search and indexing code should correctly handle terms with zero-length text, although I have never tested this. However I know of no query parser syntax to generate such terms in a query. But it should work to use them in a manually constructed query. Doug Minh

Re: write.lock file

2002-12-17 Thread Doug Cutting
Sale, Doug wrote: it depends on what you mean by corrupt. i think there are 3 cases: 1) the process died during a non-writing action (woo-hoo!) 2) the process died during a user-writing action (building a document) 3) the process died during a system-writing action (writing an index file) i

Re: Lucene Benchmarks and Information

2002-12-20 Thread Doug Cutting
Armbrust, Daniel C. wrote: While I was trying to build this index, the biggest limitation of Lucene that I ran into was optimization. Optimization kills the indexers performance when you get between 3-5 million documents in an index. On my Windows XP box, I had to reoptimize every 100,000

Re: write.lock file

2002-12-20 Thread Doug Cutting
petite_abeille wrote: On Tuesday, Dec 17, 2002, at 17:43 Europe/Zurich, Doug Cutting wrote: Index updates are atomic, so it is very unlikely that the index is corrupted, unless the underlying file system itself is corrupted. Ummm... Perhaps in theory... In practice, indexes seems to get

Re: Lucene Benchmarks and Information

2002-12-20 Thread Doug Cutting
petite_abeille wrote: On Friday, Dec 20, 2002, at 19:58 Europe/Zurich, Scott Ganyo wrote: FYI: The best thing I've found for both increasing speed and reducing file handles is to use an IndexWriter on a RamDirectory for indexing and then use FileWriter.addIndexes() to write the result to

Re: How to obtain unique field values

2002-12-30 Thread Doug Cutting
Erik Hatcher wrote: Is it possible for me to retrieve all the values of a particular field that exists within an index, across all documents? For example, I'm indexing documents that have a category associated with them. Several documents will share the same category. I'd like to be able to

Re: Incomprehensible (to me) tokenizing behavior

2002-12-30 Thread Doug Cutting
Terry Steichen wrote: I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the a set of strings which produced the following results: aa/bb/cc/dd was tokenized into 4 terms: aa, bb, cc, dd aa/bb/cc/d1 was tokenized into 3 terms: aa, bb, cc/d1 aa/bb/c1/dd was tokenized into 2

Re: QueryParser question

2002-12-31 Thread Doug Cutting
Erik Hatcher wrote: I'd like to revisit this issue. First, I add the path field to the Document in this way: doc.add(Field.Keyword(path, path)); This field is, of course, not tokenized by the Analyzer, right? So shouldn't QueryParser take this fact into account on a field-by-field

Re: QueryParser question

2002-12-31 Thread Doug Cutting
Doug Cutting wrote: However, in most cases where this is an issue, the real problem is that folks are placing too much reliance on the query parser. The query parser is designed for user-entered queries. If you're programmatically generating query strings that are then fed to the query

Re: Optimization Question

2003-01-06 Thread Doug Cutting
It should always be safe to search an index, even while optimizing. Harpreet S Walia wrote: Hi, I am using lucene on windows and have the following query abt optimization. Is it safe to search if a optimize process is going on . i found a reference of this in the archives which said that on

Re: Lucene and thread safety

2003-01-06 Thread Doug Cutting
Lucene is thread and process safe. An IndexReader, once opened, always reflects the same state of the index. To see changes made by another thread or process you must open a new IndexReader. Doug Joe Consumer wrote: I read a while back that Lucene is not thread safe. That was in the FAQ on

Re: Bad file descriptor?

2003-01-08 Thread Doug Cutting
My guess would be that you're using an IndexReader that has been closed. Doug petite_abeille wrote: Hello, Here is another symptom of misbehavior in Lucene: java.io.IOException: Bad file descriptor at java.io.RandomAccessFile.readBytes(Native Method) at

Re: read past EOF?

2003-01-08 Thread Doug Cutting
petite_abeille wrote: On Tuesday, Jan 7, 2003, at 22:46 Europe/Zurich, Doug Cutting wrote: This could happen if Lucene's file locking is disabled or broken. [ ... ] File locking is known to be broken over NFS, and wasn't even present in early versions of Lucene. Are you using an ordinary

Re: how to join 2 queries togther

2003-01-20 Thread Doug Cutting
Do you want hits to contain the word words or not? You've got it in both clauses... Also, +(a b c) requires that any of a b or c be in a document, but not necessarily all of them. If you want it to contain all of them then each term must be required, e.g., +a +b +c. In the latest sources

Re: Computing Relevancy Differently

2003-02-07 Thread Doug Cutting
Terry Steichen wrote: I read all the relevant references I could find in the Users (not Developers) list, and I still don't exactly know what to do. What I'd like to do is get a relevancy-based order in which (a) longer documents tend to get more weight than shorter ones, (b) a document body

Re: Computing Relevancy Differently

2003-02-10 Thread Doug Cutting
Terry Steichen wrote: Can you give me an idea of what to replace the lengthNorm() method with to, for example, remove any special weight given to shorter matching documents? The goal of the default implementation is not to give any special weight to shorter documents, but rather to remove the

Re: Phrase query and porter stemmer

2003-02-13 Thread Doug Cutting
Mailing Lists Account wrote: Doug Cutting wrote: That's because Google and most internet search engines never do any stemming. Generally speaking, are there any advantages not to apply the stemmer ? Except for certain keywords,I found use of stemmers helpful. Generally speaking, stemmers

Re: Score per Term

2003-02-24 Thread Doug Cutting
Check out the new Explanation API in the latest CVS sources. It permits one to get a detailed explanation of how a query was scored against a document. Note that these explanations are designed for user perusal, not for further computation, and are as expensive to construct as re-running the

Re: Computing Relevancy Differently

2003-02-28 Thread Doug Cutting
: Doug Cutting [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, February 10, 2003 1:57 PM Subject: Re: Computing Relevancy Differently Terry Steichen wrote: Can you give me an idea of what to replace the lengthNorm() method with to, for example, remove any special weight

Re: IndexReader.delete(int) not working for me

2003-03-05 Thread Doug Cutting
Joseph Ottinger wrote: Then this means that my IndexReader.delete(i) isn't working properly. What would be the common causes for this? My log shows the documents being deleted, so something's going wrong at that point. Are you closing the IndexReader after doing the deletes? This is required for

Re: Need help in changing the search score

2003-03-11 Thread Doug Cutting
Ching-Pei Hsing wrote: Even if we boost the Name by 10 like the following query, It's still the same. query = (NAME:inn NAME:comfort NAME:shampoo)^10 (MMNUM:inn MMNUM:shampoo MMNUM:comfort) (SMNUM:shampoo SMNUM:comfort SMNUM:inn) In the 1.2 release, I don't think this sort of boosting (of a

Re: Range of Score Values?

2003-03-14 Thread Doug Cutting
Rishabh Bajpai wrote: I am getting a long value between 1(included) and 0(excluded-I think), and it makes sense to me logically as well - I wouldnt know what a value of greater than 1 would mean, and why should a term that has a score of 0 be returned in the first place! But just to be sure, I

Re: multiple collections indexing

2003-03-19 Thread Doug Cutting
Morus Walter wrote: Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document

new Lucene release: 1.3 RC1

2003-03-24 Thread Doug Cutting
There's a new Lucene release available for download. See the website for details: http://jakarta.apache.org/lucene/docs/index.html Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Changing Field type

2003-03-27 Thread Doug Cutting
Maik Schreiber wrote: In an index I have documents with a field that has been constructed using Field.UnIndexed(). Now I want to switch to Field.Keyword() so I can search for those fields, too. Does it cause any harm if I'm mixing field types like that? I think this used to throw an exception,

Re: Where to get stopword lists?

2003-06-06 Thread Doug Cutting
Ulrich Mayring wrote: does anyone know of good stopword lists for use with Lucene? I'm interested in English and German lists. The Snowball project has good stop lists. See: http://snowball.tartarus.org/ http://snowball.tartarus.org/english/stop.txt

  1   2   3   4   >