Re: Query Term Questions

2004-01-21 Thread Doug Cutting
Terry Steichen wrote: 1) Is there a way to set the query boost factor depending not on the presence of a term, but on the presence of two specific terms? For example, I may want to boost the relevance of a document that contains both "iraq" and "clerics", but not boost the relevance of documents t

Re: Vector -> LinkedList for performance reasons...

2004-01-21 Thread Doug Cutting
Francesco Bellomi wrote: I agree that synchronization in Vector is a waste of time if it isn't required, It would be interesting to see if such synchronization actually impairs overall performance significantly. This would be fairly simple to test. but I'm not sure if LinkedList is a better (fas

Re: setMaxClauseCount ??

2004-01-21 Thread Doug Cutting
Andrzej Bialecki wrote: Karl Koch wrote: I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. You should try to reduce the dimensionality by reducing the number of unique features. In this case, you

Re: difference in javadoc and faq similarity expression

2004-01-20 Thread Doug Cutting
Nicolas Maisonneuve wrote: in the Similarity Javadoc score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) * lengthNorm(t.field in d) * coord(q,d) * queryNorm(q) ] in the FAQ score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d In FAQ | In Javadoc 1 / no

Re: setMaxClauseCount ??

2004-01-20 Thread Doug Cutting
setMaxClauseCount determines the maximum number of clauses, which is not your problem here. Your problem is with required clauses. There may only be a total of 31 required (or prohibited) clauses in a single BooleanQuery. If you need more, then create more BooleanQueries and combine them wit

Re: Gettting all index fields of an index

2004-01-20 Thread Doug Cutting
Try calling IndexReader.getFieldNames(). Karl Koch wrote: How can I get a list of all fields in an index from which I know only the directory string? Karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-m

Re: Ordening documents

2004-01-20 Thread Doug Cutting
Yes, this is correct. Peter Keegan wrote: So they are sorted by reverse document number. Is this the 'external' document number (the one that is adjusted for the segment's base)? If so, then this means that documents with equal score are returned in the order in which they were added to the index.

Re: IndexReader.document(int i)

2004-01-20 Thread Doug Cutting
Nicolas Maisonneuve wrote: i would like to know in the IndexReader.document(int i) what is this number i ? if the the first document is the oldest document indexed and the last the youngest ? (so we can sort by date easyly) ? Yes, documents with lower numbers were indexed earlier. As documen

Re: mergeFactor and maxMergeDocs

2004-01-20 Thread Doug Cutting
Chong, Herb wrote: what effect and what recommendations are valid for Lucene 1.3? Same as always: use the defaults and call optimize() only when you know you won't be changing the index for a while. If you have lots of RAM, increasing minMergeDocs may increase indexing speed, but raising it too

Re: Getting word freqency?

2004-01-13 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I would like to get a word frequency list from a text. How can I archive this in the most direct way using Lucene classes? Can I do it without generating an index? No, if you want Lucene to compute frequencies, then you need to create an index. Doug ---

new release: 1.3 final

2003-12-26 Thread Doug Cutting
A new Lucene release is available. It can be downloaded from: http://cvs.apache.org/dist/jakarta/lucene/v1.3-final/ Release notes are at: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.65 Happy Holidays! Doug

Re: Sentence Endings: IndexWriter.maxFieldLength and Token.setPositionIncrement()

2003-12-19 Thread Doug Cutting
Jochen, Someone else recently made a similar, reasonable complaint. I agree that this should be fixed. The fastest way to get it fixed would be to submit a patch to lucene-dev, with a test case, etc. Doug Jochen Frey wrote: Hi! I hope this is the right forum for this post. I was wondering

Re: best way of reusing IndexSearcher objects

2003-12-18 Thread Doug Cutting
Doug Cutting wrote: That's true. If you're doing updates (as opposed to just additions) then you probably want to do something like: 1. keep a single open IndexReader used by all searches 2. Every few minutes, process updates as follows: a. open a second IndexReader b.

Re: best way of reusing IndexSearcher objects

2003-12-18 Thread Doug Cutting
Dror Matalon wrote: There are two issues: 1. Having new searches start using the new index only when it's ready, not in a "half baked" state, which means that you have to synchronize the switch from the old index to the new one. That's true. If you're doing updates (as opposed to just additions)

Re: field boosting best practise

2003-12-16 Thread Doug Cutting
If you wish to boost the title field for every query then it would be easiest to boost the title clause of your query, with Query.setBoost(). Field.setBoost() should only be used when you want to give a field different boosts in different documents, but since you want to boost all titles by th

Re: Summarization; sentence-level and document-level filters.

2003-12-16 Thread Doug Cutting
It sounds like you want the value of a stored field (a summary) to be built from the tokens of another field of the same document. Is that right? This is not presently possible without tokenizing the field twice, once to produce its summary and once again when indexing. Doug Gregor Heinrich

Re: Lucene Art

2003-12-12 Thread Doug Cutting
Your setup sounds good to me. Scott Smith wrote: I'm not having a problem. The question is whether I picked a reasonable set of parameters for what I'm doing. I have an application which receives messages. Each message averages around 4k bytes and I get, on average, 0-10 every minute. So my app

Re: Unindexed fields

2003-12-11 Thread Doug Cutting
Stored fields should be able to hold values with up to Integer.MAX_VALUE characters, as should an indexed term. Can you please provide a complete, self-contained test case? Doug Chong, Herb wrote: i had an UnIndexed field which was 300 bytes. i changed it to 10,000 bytes. i also had a Text fie

Re: FSDIrectory.create doesn't tolerate subdirectories

2003-12-08 Thread Doug Cutting
I agree. One should provide Lucene with a unique path in the filesystem, one that is not intended to be used for any other purpose. All access to that path should be through Lucene's API. The fact that Lucene decides to create a directory there rather than a single file is an implementation d

Re: implementing a TokenFilter for aliases

2003-12-05 Thread Doug Cutting
Position increments are for relative token positions. A position increment of zero means that a token is logically at the same position as the previous token. A position increment of one means that a token immediately follows the preceding token in the stream, it's the next token to the right

Re: Testing for Optimization

2003-12-05 Thread Doug Cutting
jt oob wrote: Can I safely delete those files which do not have the prefix listed in the segments file? Have a look at the index file format documentation: http://jakarta.apache.org/lucene/docs/fileformats.html The only file besides segments that should exist is the "deleteable" file, and the

Re: Index and Field.Text

2003-12-05 Thread Doug Cutting
Tatu Saloranta wrote: Also, shouldn't there be at least 3 methods that take Readers; one for Text-like handling, another for UnStored, and last for UnIndexed. How do you store the contents of a Reader? You'd have to double-buffer it, first reading it into a String to store, and then tokenizing t

Re: NPE when using explain

2003-12-04 Thread Doug Cutting
This looks like a bug. I think your query contains a term in a field that is not indexed, and hence has no norm value. Perhaps (as Brisbart Franck) suggests, it is indexed in some documents, but not in others. But, in a single IndexReader, if it is indexed in any document, it should have a no

Re: New Lucene-powered Website

2003-12-02 Thread Doug Cutting
Otis Gospodnetic wrote: There was discussion about it, yes. I don't think we ever reached any conclusions, and the powered.html still says 'include the logo'. Actually, I think we decided to scrap the requirement, but then we never updated the web site. Here's the message I found: http://nagoya

Re: Dates and others

2003-12-01 Thread Doug Cutting
Dion Almaer wrote: Interesting. I implemented an approach which boosted based on the number of months in the past, and after tweaking the boost amounts, it seems to do the job. I do a fresh reindex every night (since the indexing process takes no time at all... unlike our old search solution!) I

Re: disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Doug Cutting
Kevin A. Burton wrote: Would there be any performance improvement in query throughput and latency if locking were disabled for readonly indexes? The locks are only consulted when opening a new IndexReader. I doubt very much that you're doing this often enough for this to be significant. Doug -

Re: AW: AW: Real Boolean Model in Lucene?

2003-12-01 Thread Doug Cutting
Karsten Konrad wrote: Now hell would be the place for me where I would have to prove that Lucene's ranking is exactly equivalent to some transformation of vector space and then using the *cosine* for the ranking. Can't be really, as Lucene sometimes returns results > 1.0 and only some ruthless no

Re: Dates and others

2003-12-01 Thread Doug Cutting
Dion Almaer wrote: The only real item that I still want to tweak more is getting recent results higher in the list. I was wondering if something like this could work (or if there is a better solution) At index time, I have the date of the content. I could do some math where the higher the date

new release: 1.3 RC3

2003-11-25 Thread Doug Cutting
A new Lucene release is available. It can be downloaded from: http://cvs.apache.org/dist/jakarta/lucene/v1.3-rc3/ Release notes are at: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.58 Enjoy! Doug ---

Re: Fields with same name but different boosts

2003-11-24 Thread Doug Cutting
Andrzej Bialecki wrote: Now, I'm wondering how do I encode the weight of keywords... If I do the following: Field f = Field.Keyword("kw", "value1"); f.setBoost(10.0); doc.add(f); f = Field.Keyword("kw", "value2"); f.setBoost(20.0); doc.add(f); Now the question is: what is the boost value for the

Re: Lucene refresh index function (incremental indexing).

2003-11-24 Thread Doug Cutting
Tun Lin wrote: These are the steps I took: 1) I compile all the files in a particular directory using the command: java org.apache.lucene.demo.IndexHTML -create -index c:\\index .. , putting all the indexed files in c:\\index. 2) Everytime, I added an additional file in that directory. I need to

Re: Which operations change document ids?

2003-11-17 Thread Doug Cutting
ssage----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: 17 November 2003 19:51 To: Lucene Users List Subject: Re: Which operations change document ids? Tate Avery wrote: My first question is: should I steer clear of this all together? No, I think this is appropriate. If not, I need to kn

Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
bably won't substantially alter the ranking. Is 100 long enough? Perhaps not. But 1000 is certainly plenty long. Doug Chong, Herb wrote: any arbitrary number you pick will be broken by some document someone puts into the system. Herb -Original Message----- From: Doug Cutting [mai

Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
Karsten Konrad wrote: I was wondering whether we could, while indexing, make a use of this by increasing the position counter by a large number, let's say 1000, whenever we encounter a sentence separator (Note, this is not trivial; not every '.' ends a sentence etc. etc. etc.). Thus, searching

Re: Which operations change document ids?

2003-11-17 Thread Doug Cutting
Tate Avery wrote: My first question is: should I steer clear of this all together? No, I think this is appropriate. If not, I need to know which Lucene operations can cause document ids to change. I am assuming that the following can cause potential changes: 1) Add document 2) Op

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
Leo Galambos wrote: There are other (more trivial) problems as well. One geek from UFAL (our NLP lab) reported, that it was a hard problem to find the boundaries, or rather, to say whether a dot is a dot or something else, i.e. "blah, i.e. blah" "i.b.m." "i.p. pavlov" "3.14" "28.10.2003" etc. O

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
t have typed capital gains tax. there is psychology of query creation too and that is one thing i am taking advantage of. Herb -----Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 3:15 PM To: Lucene Users List Subject: Re: inter-term correlati

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
Chong, Herb wrote: since i am working now on financial news, here is an example: capital gains tax if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Doug Cutting
Jie Yang wrote: In this case, probably using a single RAMDirectory would allow me to run parallel searching without worry about disk access. Well, anyone tried to have a RAMDirectory of 5G in size? I don't know of a Java implementation which lets you have a heap larger than 2GB. In my experience,

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Doug Cutting
Jie Yang wrote: --- Erik Hatcher <[EMAIL PROTECTED]> wrote: Well, not quite, User normally enters a search string A that normally returns 1000 out of 2 millions docs. I then append A with 500 OR conditions... A AND (B or C or ... or x500). Are you adding the same 500 terms to each query? Or even

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Doug Cutting
Dan Quaroni wrote: name:Bob's Discount Furniture AND state:California AND city:San Diego Now, that query is going to retrieve EVERY Bob's discount furniture, EVERY company in California, and EVERY city in San Diego and then join them. That makes the memory requirements for this query far higher t

Re: Objection to using /tmp for lock files.

2003-11-13 Thread Doug Cutting
Dror Matalon wrote: In there a reason why RODirectory shouldn't just be rolled into lucene? http://www.csita.unige.it/software/free/lucene/ This just looks like a version of FSDirectory with lock files disabled. I think it would be better to just make it easier to disable lock files. Currently

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Doug Cutting
William W wrote: If I have two indexes and use the MultiSearcher will it be faster than only one index with all the documents ? No, in fact it would be slower. However it could be faster if (a) someone contributes a parallel version of MultiSearcher and (b) you're either running on a multiple-

Re: Objection to using /tmp for lock files.

2003-11-13 Thread Doug Cutting
Kevin A. Burton wrote: When I first read this changelog entry: > 2. Changed file locking to place lock files in >System.getProperty("java.io.tmpdir"), where all users are >permitted to write files. This way folks can open and correctly >lock indexes which are read-only to them. I

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Doug Cutting
First, note that the approaches you describe will only improve performance if you have multiple CPUs and/or multiple disks holding the indexes. Second, MultiSearcher is currently implemented to search indexes serially, not each in a separate thread. To implement multi-threaded searching one c

Re: Biggest index size/document in Lucene

2003-11-04 Thread Doug Cutting
There was a bug (recently fixed) when creating indexes with over a couple hundred million documents. So you should use 1.3 RC2, which has a fix for this bug. The biggest indexes I've personally created have around 30M documents. I maintain these as a set of separately updated indexes, then mer

Re: Exact Match

2003-10-22 Thread Doug Cutting
Wilton, Reece wrote: Does Lucene support exact matching on a tokenized field? So for example... if I add these three phrases to the index: - "The quick brown fox" - "The quick brown fox jumped" - "brown fox" I want to be able to do an exact field match so when I search for "brown fox" I only get t

Re: new release: 1.3 RC2

2003-10-22 Thread Doug Cutting
petite_abeille wrote: Quick question regarding release note number 11: What's the difference between IndexWriter.addIndexes(IndexReader[]) and IndexWriter.addIndexes(Directory[]) beside the fact that one takes an array of IndexReader and the other an array of Directory? Any functional differenc

new release: 1.3 RC2

2003-10-22 Thread Doug Cutting
A new Lucene release is available. It can be downloaded from: http://cvs.apache.org/dist/jakarta/lucene/v1.3-rc2/ Release notes are at: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.56 Enjoy! Doug ---

Re: positional token info

2003-10-21 Thread Doug Cutting
Erik Hatcher wrote: Just for fun, I've written a simple stop filter that bumps the position increments to account for the stop words removed: But its practically impossible to formulate a Query that can take advantage of this. A PhraseQuery, because Terms don't have positional info (only the t

Re: Lucene on Windows

2003-10-21 Thread Doug Cutting
Tate Avery wrote: You might have trouble with "too many open files" if you set your mergeFactor too high. For example, on my Win2k, I can go up to mergeFactor=300 (or so). At 400 I get a too many open files error. Note: the default mergeFactor of 10 should give no trouble. Please note that it is

Re: Too Many Open Files

2003-10-07 Thread Doug Cutting
Wilton, Reece wrote: The index directory that Lucene created has 2,322 files in it. When I try to open it I get the dreaded "Too Many Open Files" problem: java.io.FileNotFoundException: C:\Index\_1lvq.f107 (Too many open files) The index has about 50,000 docs in it. It was created with a merg

Re: Lucene Scoring Behavior

2003-09-17 Thread Doug Cutting
ginal Message - From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 5:51 PM Subject: Re: Lucene Scoring Behavior Terry Steichen wrote: 0.03125 = fieldNorm(field=pub_date, doc=90992) 1.0 = fi

Re: Lucene Scoring Behavior

2003-09-17 Thread Doug Cutting
Terry Steichen wrote: 0.03125 = fieldNorm(field=pub_date, doc=90992) 1.0 = fieldNorm(field=pub_date, doc=90970) It looks like the fieldNorm's are what differ, not the IDFs. These are the product of the document and/or field boost, and 1/sqrt(numTerms) where numTerms is the number of terms in

Re: Lucene Scoring Behavior

2003-09-17 Thread Doug Cutting
If you're using RangeQuery to do date searching, then you'll likely see unusual scoring. The IDF of a date, like any other term, is inversely related to the number of documents with that date. So documents whose dates are rare will score higher, which is probably not what you intend. Using a

Re: slow performance with Date Range Searching

2003-09-17 Thread Doug Cutting
Killeen, Tom wrote: My query would look something like this: LongTitle:killeen AND LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO "2002-04-04"] and it returned in 5.7 seconds Does anyone have any suggestions for searching date ranges. Our ranges will generally be between a 3

Re: endOffset, startOffset of Token

2003-09-12 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Are the endOffset, startOffset fields of a Token used in proximity search and phrase search? No. There are not used by indexing or search. Their intent is only to aid the extraction of matching text snippets when displaying results. Doug --

Re: Split results based on the value of a field

2003-09-11 Thread Doug Cutting
Jon Pither wrote: I have a requirement whereupon I'd like to pull search results back and split them up based on some keyword field. So for example, says there's a field named 'category', I'd like to be able to have the results displayed as such: Search Results for Category A: 1, 2, 3, Search

Re: Lucene features

2003-09-11 Thread Doug Cutting
Leo Galambos wrote: Example: I use this notation: inverted_list_term:{list of W values, "-" denotes W=0, for 12 documents in a collection} A:{23[16]--27} B:{--[38]} C:{18[2-]45239812} If your first query is B, the subset of documents (denoted by brackets - namely, the 3rd and 4th doc)

Re: Lucene features

2003-09-11 Thread Doug Cutting
Erik Hatcher wrote: Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be reme

Re: Lucene documentation

2003-09-11 Thread Doug Cutting
Terry Steichen wrote: PS: If there is general interest in doing some documentation enhancement, I'd be happy to participate/contribute. I think there's always room for better documentation. If you have ideas about this, and, more importantly, time to contribute, please have a go at it. Doug --

Re: Thread safety of QueryParser

2003-08-26 Thread Doug Cutting
Luke Francl wrote: According to the jGuru FAQ, QueryParser is not thread safe: http://www.jguru.com/faq/view.jsp?EID=492389 However, this information is several years old. Is this still true? The answer to the question suggests using a new parser for every thread, but the QueryParser.parse(Strin

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Doug Cutting
Leo Galambos wrote: Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). It depends. If you have 10 machines, each with a single disk, that you use for indexing in parallel,

Re: Will failed optimize corrupt an index?

2003-08-20 Thread Doug Cutting
The index should be fine. Lucene index updates are atomic. Doug Dan Quaroni wrote: My index grew about 7 gigs larger than I projected it would, and it ran out of disk space during optimize. Does lucene have transactions or anything that would prevent this from corrupting an index, or do I need

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Doug Cutting
As the index grows, disk i/o becomes the bottleneck. The default indexing parameters do a pretty good job of optimizing this. But if you have lots of CPUs and lots of disks, you might try building several indexes in parallel, each containing a subset of the documents, optimize each index and

Re: Lucene Index on NFS Server

2003-08-20 Thread Doug Cutting
ter.close() or IndexReader.close() for deletions) then this will not be a problem. Doug Morus Walter wrote: Doug Cutting writes: Can I have a lucene index on a NFS filesystem without problems (access is readonly)? So long as all access is read-only, there should not be a problem. Keep in mind however that

Re: Searching while optimizing

2003-08-20 Thread Doug Cutting
is not thread safe in this regard. Here is a quote from Doug Cutting, the creator of Lucene: The problems are only when you add documents or optimize an index, and then search with an IndexReader that was constructed before those changes to the index were made. A possible work around is to perform the

Re: How do you pronounce 'Lucene'?

2003-08-14 Thread Doug Cutting
Loo-seen. Danny Sofer wrote: ...and where does the name come from? It's my wife's middle name, and her maternal grandmother's first name. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL P

Re: Searching while optimizing

2003-07-31 Thread Doug Cutting
Aviran Mordo wrote: Is it possible and safe to search an index while another thread adds documents or optimizes the same index? Yes. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Index on NFS Server

2003-07-31 Thread Doug Cutting
Morus Walter wrote: Can I have a lucene index on a NFS filesystem without problems (access is readonly)? So long as all access is read-only, there should not be a problem. Keep in mind however that lock files are known to not work correctly over NFS. Doug ---

Re: Indexing very large sets (10 million docs)

2003-07-30 Thread Doug Cutting
Roger Ford wrote: I do have another problem: running multi-user tests - four "users" all firing off queries one after the other - I hit this exception at the start of one run: caught a class java.io.IOException with message: Timed out waiting for [EMAIL PROTECTED]:\Lucene_Index\Index0001\comm

Re: Safe to write while optimizing?

2003-07-29 Thread Doug Cutting
Wilton, Reece wrote: Three questions: - Is it safe to have two IndexWriters open on the same index? No. It is not safe, and the code makes every attempt to prohibit it. - Is it safe to have two IndexWriters adding a document concurrently? No, but you can have two threads adding documents to a sin

Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Doug Cutting
Ryan Clifton wrote: You seem to by implying that it is possible to optimize very large indexes. My index has a couple million records, but more importantly it's about 40 gigs in size. I have tried many times to optimize it and this always results in hitting the Linux file size limit. Is there a

Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Doug Cutting
Armbrust, Daniel C. wrote: If you set your mergeFactor back down to something closer to the default (10) - you probably wouldn't have any problems with file handles. The higher you make it, the more open files you will have. When I set it at 90 for performance reasons, I would run out of file han

Re: parallizing index building

2003-06-30 Thread Doug Cutting
Marc Dumontier wrote: I'm indexing 500 XML files each ~150Mb on an 8 CPU machine. I'm wondering what the best strategy for making maximum use of resources is. I have the tweaked the single process indexer to index 5000 records (not files) in memory before writing out to disk. Should i create an I

Re: Geting exact term positions for each document inside a collectmethod...

2003-06-30 Thread Doug Cutting
Jim Hargrave wrote: I've defined my own collector (I want the raw score before it is normalized between 1.0 and 0.0). For each document I need to know the the matching term positions in the document. I've seen the methods in IndexReader, but how can I access them inside my collect method? Are ther

Re: HitCollector not serializable (Bug?)

2003-06-16 Thread Doug Cutting
The HitCollector-based search API is not meant to work remotely. To do so would involve an RPC-callback for every non-zero score, which would be extremely expensive. Also, just making HitCollector serializable would not be sufficient. You'd also need to pass in a HitCollector implementation

Re: How to get field contents

2003-06-13 Thread Doug Cutting
This can be done more efficiently if you only want to enumerate the terms of a particular field. Term enumerations are ordered first by field, then by the term text. You can also specify the initial position of a term enumeration. Thus an efficient enumeration of the terms in "myField" can b

Re: Storing binary data

2003-06-12 Thread Doug Cutting
Eric Jain wrote: Has anyone ever considered storing binary data into an index? In particular, serialized objects? This would seem to be a natural solution in certain situations, and avoids many problems that arise when using a seperate object store (e.g. Jisp): inconsistencies while updating, and a

Re: OutOfMemoryErrors searching with WildCardQueries

2003-06-12 Thread Doug Cutting
Konrad Kolosowski wrote: If the index grows to hundred thousand documents, with users simultaneously searching indexes for different locales, what is the best way to cup the memory requirement? Limiting number of terms, or number of terms containing wild cards, or eliminating wild card searches al

Re: Where to get stopword lists?

2003-06-06 Thread Doug Cutting
Ulrich Mayring wrote: does anyone know of good stopword lists for use with Lucene? I'm interested in English and German lists. The Snowball project has good stop lists. See: http://snowball.tartarus.org/ http://snowball.tartarus.org/english/stop.txt http://snowball.tartarus.org/german/stop

Re: search item with '-' in it

2003-06-05 Thread Doug Cutting
Lixin Meng wrote: Therefore, it would be preferable to treat all hyphen in the same way. Either as a delimiter or as part of the word (maybe with a flag at the API). If we change StandardTokenizer in this way then we risk breaking all the applications that currently use it and depend on its curren

Re: search item with '-' in it

2003-06-05 Thread Doug Cutting
You should look at the output of your analyzer. Just write a simple test program, something like: public static void main(String[] args) throws Exception { System.out.println("Tokenizing " + args[0]); Analyzer analyzer = new MyAnalyzer(...); TokenStream ts = analyzer.tokenStream(ne

Re: java.lang.IllegalArgumentException: attempt to access a deleteddocument

2003-06-05 Thread Doug Cutting
Rob Outar wrote: public synchronized String[] getDocuments() throws IOException { IndexReader reader = null; try { reader = IndexReader.open(this.indexLocation); int numOfDocs = reader.numDocs(); String[] docs = new String[numOfDocs];

Re: Changing Field type

2003-03-27 Thread Doug Cutting
Maik Schreiber wrote: In an index I have documents with a field that has been constructed using Field.UnIndexed(). Now I want to switch to Field.Keyword() so I can search for those fields, too. Does it cause any harm if I'm mixing field types like that? I think this used to throw an exception, bu

new Lucene release: 1.3 RC1

2003-03-24 Thread Doug Cutting
There's a new Lucene release available for download. See the website for details: http://jakarta.apache.org/lucene/docs/index.html Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTE

Re: multiple collections indexing

2003-03-19 Thread Doug Cutting
Morus Walter wrote: Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document

Re: Range of Score Values?

2003-03-14 Thread Doug Cutting
Rishabh Bajpai wrote: I am getting a long value between 1(included) and 0(excluded-I think), and it makes sense to me logically as well - I wouldnt know what a value of greater than 1 would mean, and why should a term that has a score of 0 be returned in the first place! But just to be sure, I want

Re: Need help in changing the search score

2003-03-11 Thread Doug Cutting
Ching-Pei Hsing wrote: Even if we boost the Name by 10 like the following query, It's still the same. query = (NAME:inn NAME:comfort NAME:shampoo)^10 (MMNUM:inn MMNUM:shampoo MMNUM:comfort) (SMNUM:shampoo SMNUM:comfort SMNUM:inn) In the 1.2 release, I don't think this sort of boosting (of a compl

Re: IndexReader.delete(int) not working for me

2003-03-05 Thread Doug Cutting
Joseph Ottinger wrote: Then this means that my IndexReader.delete(i) isn't working properly. What would be the common causes for this? My log shows the documents being deleted, so something's going wrong at that point. Are you closing the IndexReader after doing the deletes? This is required for

Re: IndexReader.delete(int) not working for me

2003-03-05 Thread Doug Cutting
Joseph Ottinger wrote: I've got a versioning content system where I want to replace documents in a lucene repository. To do so, according to the FAQ and the mailing list archives, I need to open an IndexReader, look for the document in question, delete it via the IndexReader, and then add it. This

Re: Computing Relevancy Differently

2003-02-28 Thread Doug Cutting
'll start experimenting shortly. Regards, Terry - Original Message - From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, February 10, 2003 1:57 PM Subject: Re: Computing Relevancy Differently Terry Steichen

Re: Indexing Tips and Hints

2003-02-25 Thread Doug Cutting
These sort of tricks can help things some if index i/o is really your bottleneck. Are you convinced that it is? When i/o is a bottleneck the CPU typically spends a large portion of its time idle. Do you see this? From your description (indexing ~300k 5k documents takes over 24 hours) I would

Re: Indexing Tips and Hints

2003-02-25 Thread Doug Cutting
I doubt this will make Lucene much faster, since Lucene already implements buffering in its InputStream and OutputStream classes. So Lucene already has this optimization built-in. Doug Andrzej Bialecki wrote: Hello, Since you are trying this anyway, and looking for ways to improve indexing t

Re: Score per Term

2003-02-24 Thread Doug Cutting
Andrzej Bialecki wrote: Do you think it would be possible/feasible to modify the searching classes so that they create Explanations at the same time I'm running the query? That's not feasable because it would slow down query execution too much. Doug -

Re: Score per Term

2003-02-24 Thread Doug Cutting
Check out the new Explanation API in the latest CVS sources. It permits one to get a detailed explanation of how a query was scored against a document. Note that these explanations are designed for user perusal, not for further computation, and are as expensive to construct as re-running the

don't crosspost to lucene-user and lucene-dev

2003-02-20 Thread Doug Cutting
Please send Lucene-related messages to just one of lucene-user or lucene-dev, *not* both. The lucene-dev list should be considered a subset of the lucene-user list that is concerned with the development of lucene. Things that should be sent to this list are: . reproducible bug reports, compl

Re: Phrase query and porter stemmer

2003-02-13 Thread Doug Cutting
Mailing Lists Account wrote: Doug Cutting wrote: That's because Google and most internet search engines never do any stemming. Generally speaking, are there any advantages not to apply the stemmer ? Except for certain keywords,I found use of stemmers helpful. Generally speaking, ste

Re: Phrase query and porter stemmer

2003-02-12 Thread Doug Cutting
Mailing Lists Account wrote: I use PorterStemmer with my analyzer for indexing the documents. And I have been using the same analyzer for searching too. When I search for a phrase like "security" AND database, I would like to avoid matches for terms like "secure" or "securities" . I observed tha

Re: multi-station netspread indexing

2003-02-12 Thread Doug Cutting
The RemoteSearchable class (in the latest CVS) will let you do this. It uses Java RMI to let you search indexes on other machines. With a MultiSearcher you can then search a number of independently maintained indexes on different machines. MultiSearcher searches indexes serially, but it woul

<    1   2   3   4   5   >