Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Stanislav Jordanov wrote:
startTs = System.currentTimeMillis();
dummyMethod(hits.doc(nHits - nHits));
stopTs = System.currentTimeMillis();
System.out.println(Last doc accessed in  + (stopTs -
startTs)
+ ms);
'nHits - nHits' always equals zero.  So you're actually printing the 
first document, not the last.  The last document would be accessed with 
'hits.doc(nHits)'.  Accessing the last document should not be much 
slower (or faster) than accessing the first.

200+ milliseconds to access a document does seem slow.  Where is you 
index stored?  On a local hard drive?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Daniel Naber wrote:
After fixing this I can reproduce the problem with a local index that 
contains about 220.000 documents (700MB). Fetching the first document 
takes for example 30ms, fetching the last one takes 100ms. Of course I 
tested this with a query that returns many results (about 50.000). 
Actually it happens even with the default sorting, no need to sort by some 
specific field.
In part this is due to the fact that Hits first searches for the 
top-scoring 100 documents.  Then, if you ask for a hit after that, it 
must re-query.  In part this is also due to the fact that maintaining a 
queue of the top 50k hits is more expensive than maintaining a queue of 
the top 100 hits, so the second query is slower.  And in part this could 
be caused by other things, such as that the highest ranking document 
might tend to be cached and not require disk io.

One could perform profiling to determine which is the largest factor. 
Of these, only the first is really fixable: if you know you'll need hit 
50k then you could tell this to Hits and have it perform only a single 
query.  But the algorithmic cost of keeping the queue of the top 50k is 
the same as collecting all the hits and sorting them.  So, in part, 
getting hits 49,990 through 50,000 is inherently slower than getting 
hits 0-10.  We can minimize that, but not eliminate it.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Doug Cutting
Yonik Seeley wrote:
6. Index locally and synchronize changes periodically. This is an
interesting idea and bears looking into. Lucene can combine multiple
indexes into a single one, which can be written out somewhere else, and
then distributed back to the search nodes to replace their existing
index.
This is a promising idea for handling a high update volume because it
avoids all of the search nodes having to do the analysis phase.
A clever way to do this is to take advantage of Lucene's index file 
structure.  Indexes are directories of files.  As the index changes 
through additions and deletions most files in the index stay the same. 
So you can efficiently synchronize multiple copies of an index by only 
copying the files that change.

The way I did this for Technorati was to:
1. On the index master, periodically checkpoint the index.  Every minute 
or so the IndexWriter is closed and a 'cp -lr index index.DATE' command 
is executed from Java, where DATE is the current date and time.  This 
efficiently makes a copy of the index when its in a consistent state by 
constructing a tree of hard links.  If Lucene re-writes any files (e.g., 
the segments file) a new inode is created and the copy is unchanged.

2. From a crontab on each search slave, periodically poll for new 
checkpoints.  When a new index.DATE is found, use 'cp -lr index 
index.DATE' to prepare a copy, then use 'rsync -W --delete 
master:index.DATE index.DATE' to get the incremental index changes. 
Then atomically install the updated index with a symbolic link (ln -fsn 
index.DATE index).

3. In Java on the slave, re-open 'index' it when its version changes. 
This is best done in a separate thread that periodically checks the 
index version.  When it changes, the new version is opened, a few 
typical queries are performed on it to pre-load Lucene's caches.  Then, 
in a synchronized block, the Searcher variable used in production is 
updated.

4. In a crontab on the master, periodically remove the oldest checkpoint 
indexes.

Technorati's Lucene index is updated this way every minute.  A 
mergeFactor of 2 is used on the master in order to minimize the number 
of segments in production.  The master has a hot spare.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-28 Thread Doug Cutting
Chris Hostetter wrote:
 1) If making it mutatable requires changes to other classes to propogate
it, then why is it now an instance variable instead of a static?
(Presumably making it an instance variable allows subclasses to
override the value, but if other classes have internal expectations
of the value, that doesn't seem safe)
Its an instance variable because it can vary from instance-to-instance. 
 This value is specified when an index segment is written, and 
subsequently read from disk and used when reading that segment.  It's an 
instance variable in both the writing and reading code.  The thing 
that's lacking is a way to pass in alternate values to the writing code.

The reason that other classes are involved is that the reading and 
writing code are in non-public classes.  We don't want to expose the 
implementation too much by making these public, but would rather expose 
these as getter/setter methods on the relevant public API.

 2) Should it be configurable through a get/set method, or through a
system property?
(which rehashes the instance/global question)
That's indeed the question.  My guess is that a system property would be 
probably be sufficient for most, but perhaps not for all.  Similarly 
with a static setter/getter.  But a getter/setter on IndexWriter would 
make everyone happy.

 3) Is it important that a writer updating an existing index use the same
value as the writer that initial created the index?  if so should
there really be a preferedIndexInterval variable which is mutatable,
and a currentIndexInterval which is set to the value of the index
currently being updated.  Such that preferedIndexInterval is used when
making an index from scratch and currentIndexInterval is used when
adding segments to a new index?
It's used whenever an index segment is created.  Index segments are 
created when documents are added and when index segments are merged to 
form larger index segments.  Merging happens frequently while indexing. 
 Optimization merges all segments.

The value can vary in each segment.
The default value is probably good for all but folks with very large 
indexes, who may wish to increase the default somewhat.  Also folks with 
smaller indexes and very high query volumes may wish to decrease the 
default.  It's a classic time/memory tradeoff.  Higher values use less 
memory and make searches a bit slower, smaller values use more memory 
and make searches a bit faster.

Unless there are objections I will add this as:
  IndexWriter.setTermIndexInterval()
  IndexWriter.getTermIndexInterval()
Both will be marked Expert.
Further discussion should move to the lucene-dev list.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-25 Thread Doug Cutting
Kevin A. Burton wrote:
Whats the desired pattern of using of TermInfosWriter.indexInterval ?
There isn't one.  It is not a part of the public API.  It is an 
unsupported internal feature.

Do I have to compile my own version of Lucene to change this?
Yes.
The last 
API was public static final but this is not public nor static.
It was never public.  It used to be static and final, but is now an 
instance variable.

I'm wondering if we should just make this a value that can be set at 
runtime.  Considering the memory savings for larger installs this 
can/will be important.
The place to put getter/setters would be IndexWriter, since that's the 
public home of all other index parameters.  Some changes to 
DocumentWriter and SegmentMerger would be required to pass this value 
through to TermInfosWriter from IndexWriter.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.
It looks like you're using a pre-1.4 version of Lucene.  Since 1.4 this 
is no longer called TermInfosWriter.INDEX_INTERVAL, but rather 
TermInfosWriter.indexInterval.

Is this setting incompatible with older indexes burned with the lower 
value?
Prior to 1.4, yes.  After 1.4, no.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
Is this setting incompatible with older indexes burned with the lower 
value?
Prior to 1.4, yes.  After 1.4, no.
What happens after 1.4?  Can I take indexes burned with 256 (a greater 
value) in 1.3 and open them up correctly with 1.4?
Not without hacking things.  If your 1.3 indexes were generated with 256 
then you can modify your version of Lucene 1.4+ to use 256 instead of 
128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 today).

Prior to 1.4 this was a constant, hardwired into the index format.  In 
1.4 and later each index segment stores this value as a parameter.  So 
once 1.4 has re-written your index you'll no longer need a modified version.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Javadoc error?

2005-02-23 Thread Doug Cutting
Mark Woon wrote:
The javadoc for Field.setBoost() claims:
The boost is multiplied by |Document.getBoost()| 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html#getBoost%28%29 
of the document containing this field. If a document has multiple fields 
with the same name, all such values are multiplied together.

However, from what I can tell from IndexSearcher.explain(), multiple 
fields with the same name have their boost values added together.  It 
might very well be that I'm misinterprating what I'm seeing from 
explain(), but if I'm not, then either the javadoc is wrong or there's a 
bug somewhere...

Does anyone know which way it's actually supposed to work?
Boosts for multiple fields with the same name in the a document are 
multiplied together at index time to form the boost for that field of 
that document.  At search time, if multiple query terms from the same 
field match the same document, then that document's field boost is 
multiplied into the score for both terms, and these scores are then 
added.  If boost(field,doc) is the boost, and raw(term,doc) is the raw, 
unboosted score (I'm simplifying things) then the score for a two term 
query is something like:

  boosted(t1,t2,d) =
boost(t1.field,d)*raw(t1,d) + boost(t2.field,d)*raw(t2,d)
which, when t1 and t2 are in the same field, is equivalent to:
  boosted(t1,t2,d) = boost(field,d)*(raw(t1,d) + raw(t2,d))
The explain() feature prints things in the first form, where the boosts 
appear in separate components of a sum.

Does that help?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Iterate through all the document ids in the index?

2005-02-21 Thread Doug Cutting
William Lee wrote:
is there a simple and
fast way to get a list of document IDs through the lucene index?  

I can use a loop to iterate from 0 to IndexReader.maxDoc and
check whether an the document id is valid through
IndexReader.document(i), but this would imply that I have to
retrieve the documents fields.
Use IndexReader.isDeleted() to check if each id is valid.  This is quite 
fast.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Doug Cutting
Kevin A. Burton wrote:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing target index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.
You need to re-write the entire index using a modified 
TermIndexWriter.java.  Optimize rewrites the entire index but is 
destructive.  Merging into a new empty directory is a non-destructive 
way to do this.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?
Yes, you can go back if you re-optimize or re-merge again.
Also, there's no need to CC my personal email address.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: new segment for each document

2005-02-10 Thread Doug Cutting
Daniel Naber wrote:
On Thursday 10 February 2005 22:27, Ravi wrote:
I tried setting the minMergeFactor on the writer to one. But
it did not work.
I think there's an off-by-one bug so two is the smallest value that works 
as expected.
You can simply create a new IndexWriter for each add and then close it. 
 IndexWriter is pretty lightweight, so this shouldn't have too much 
overhead.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reconstruct segments file?

2005-02-07 Thread Doug Cutting
Ian Soboroff wrote:
Speaking of Counter, I have a dumb question.  If the segments are
named using an integer counter which is incremented, what is the point
in converting that counter into a string for the segment filename?
Why not just name the segments e.g. 1.frq, etc.?
The names are prefixed with an underscore, since it turns out that some 
filesystems have trouble (DOS?) with certain all-digit names.  Other 
than that, they are integers, just with a large radix.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reconstruct segments file?

2005-02-04 Thread Doug Cutting
Ian Soboroff wrote:
I've looked over the file formats web page, and poked at a known-good
segments file from a separate, similar index using od(1) and such.  I
guess what I'm not sure how to do is to recover the SegSize from the
segment I have.
The SegSize should be the same as the length in bytes of any of the 
.f[0-9]+ files in the segment.  If your segment is in compound format 
then you can use IndexReader.main() in the current SVN version to list 
the files and sizes in the .cfs file, including its contained .f[0-9]+ 
files.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Disk space used by optimize

2005-01-31 Thread Doug Cutting
Yura Smolsky wrote:
There is a big difference when you use compound index format or
multiple files. I have tested it on the big index (45 Gb). When I used
compound file then optimize takes 3 times more space, b/c *.cfs needs
to be unpacked.
Now I do use non compound file format. It needs like twice as much
disk space.
Perhaps we should add something to the javadocs noting this?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-27 Thread Doug Cutting
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?
You can increase TermInfosWriter.indexInterval.  You'll need to re-write 
the .tii file for this to take effect.  The simplest way to do this is 
to use IndexWriter.addIndexes(), adding your index to a new, empty, 
directory.  This will of course take a while for a 60GB index...

Doubling TermInfosWriter.indexInterval should half the Term memory usage 
and double the time required to look up terms in the dictionary.  With 
an index this large the the latter is probably not an issue, since 
processing term frequency and proximity data probably overwhelmingly 
dominate search performance.

Perhaps we should make this public by adding an IndexWriter method?
Also, you can list the size of your .tii file by using the main() from 
CompoundFileReader.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-27 Thread Doug Cutting
Peter Hollas wrote:
Currently we can issue a simple search query and expect a response back 
in about 0.2 seconds (~3,000 results) with the Lucene index that we have 
built. Lucene gives a much more predictable and faster average query 
time than using standard fulltext indexing with mySQL. This however 
returns result in score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst querying. 
This solution works fine, but is unacceptable since a query that returns 
thousands of results can take upwards of 30 seconds to sort them.
Are you using a Lucene Sort?  If you reuse the same IndexReader (or 
IndexSearcher) then perhaps the first query specifying a Sort will take 
30 seconds (although that's much slower than I'd expect), but subsequent 
searches that sort on the same field should be nearly as fast as results 
sorted by score.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Doug Cutting
Ryan Aslett wrote:
What I found was that for queries with one term (First Name), the large
index beat the multiple indexes hands down (280 Queries/per second vs
170 Q/s).
But for queries with multiple terms (Address), the multiple indexes beat
out the Large index. (26 Q/s vs 16 Q/s)
Btw, Im running these on a 2 proc box with 16GB of ram.
So what Im trying to determine Is if there is some equations out there
that can help me find the sweet spot for splitting my indexes.
What appears to be the bottleneck, CPU or i/o?  Is your test system 
multi-threaded?  I.e., is it attempting to execute many queries in 
parallel?  If you're CPU-bound then a single index should be fastest. 
Are you using compound format?  If you're i/o-bound, the non-compound 
format may be somewhat faster, as it permits more parallel i/o.  Is the 
index data on multiple drives?  If you're i/o bound then it should be 
faster to use multiple drives.  To permit even more parallel i/o over 
multiple drives you might consider using a pool of IndexReaders.  That 
way, with, e.g., striped data, each could be simultaneously reading 
different portions of the same file.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread Doug Cutting
David Spencer wrote:
Isn't ZipDirectory the thing to search for?
I think it's actually URLDirectory:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02453.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: multi-threaded thru-put in lucene

2005-01-06 Thread Doug Cutting
John Wang wrote:
1 thread: 445 ms.
2 threads: 870 ms.
5 threads: 2200 ms.
Pretty much the same numbers you'd get if you are running them sequentially.
Any ideas? Am I doing something wrong?
If you're performing compute-bound work on a single-processor machine 
then threading should give you no better performance than sequential, 
perhaps a bit worse.  If you're performing io-bound work on a 
single-disk machine then threading should again provide no improvement. 
 If the task is evenly compute and i/o bound then you could achieve at 
best a 2x speedup on a single CPU system with a single disk.

If you're compute-bound on an N-CPU system then threading should 
optimally be able to provide a factor of N speedup.

Java's scheduling of compute-bound theads when no threads call 
Thread.sleep() can also be very unfair.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: multi-threaded thru-put in lucene

2005-01-06 Thread Doug Cutting
John Wang wrote:
Is the operation IndexSearcher.search I/O or CPU bound if I am doing
100's of searches on the same query?
CPU bound.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.4.3 breaks 1.4.1 QueryParser functionality

2005-01-05 Thread Doug Cutting
Bill Janssen wrote:
Sure, if I wanted to ship different code for each micro-release of
Lucene (which, you might guess, I don't).  That signature doesn't
compile with 1.4.1.
Bill, most folks bundle appropriate versions of required jars with their 
applications to avoid this sort of problem.  How are you deploying 
things?  Are you not bundling a compatible version of the lucene jar 
with each release of your application?  If not, why not?

I'm not trying to be difficult, just trying to understand.
Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: CFS file and file formats

2004-12-23 Thread Doug Cutting
Steve Rajavuori wrote:
1) First of all, there are both CFS files and standard (non-compound) files
in this directory, and all of them have recent update dates, so I assume
they are all being used. My code never explicitly sets the compound file
flag, so I don't know how this happened.
This can happen if your application crashes while the index was being 
updated.  In this case these were never entered into the segments file 
and may be partially written.

2) Is there a way to force all files into compound mode? For example, if I
set the compound setting, then call optimize, will that recreate everything
into the CFS format?
It should.  Except, on Windows not all old CFS file will be deleted 
immediately, but may instead be listed in the 'deleteable' file for a while.

3) There are several other large .CFS files in this directory that I think
have somehow become detached from the index. They have recent update dates
-- however, the last time I ran optimize these were not touched, and they
are not being updated now. I know these segments have valid data, because
now when I search I am missing large chunks of data -- which I assume is in
these detached segments. So my thought is to edit the 'segments' file to
make Lucene recognize these again -- but I need to know the correct segment
size in order to do this. So how do I determine what the correct segment
size should be?
These could also be the result of crashes.  In this case they may be 
partially written.

The safest approach is to remove files not mentioned in the segments 
file and update the index with the missing documents.  How does your 
application recover if it crashes during an update?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: CFS file and file formats

2004-12-23 Thread Doug Cutting
Steve Rajavuori wrote:
There are around 20 million documents in the orphaned segments, so it would
take a very long time to update the index. Is there an unsafe way to edit
the segments file to add these back? It seems like the missing piece of
information I need to do this is the correct segment size -- where can I
find that?
Do the CFS and non-CVS segment names correspond?  If so, then it 
probably crashed after the segment was complete, but perhaps before it 
was packed into a CFS file.  So I'd trust the non-CFS stuff first.  And 
it's easy to see the size of a non-CVS segement: it's just the number of 
bytes in each of the .f* files.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Andrew Cunningham wrote:
computer dog~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).
Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: To Sort or not to Sort

2004-12-16 Thread Doug Cutting
Scott Smith wrote:
1.	Simply use the built-in lucene sort functionality, cache the hit
list and then page through the list.  Adv: looks pretty straight
forward, I write less code.  Dis: for searches that return a large
number of hits (having a search return several hundred to a few thousand
hits is not uncommon), Lucene is sorting a lot of entries that don't
really need to be sorted (because the user will never look at them) and
sorting tends to be expensive.
2.	The other solution uses a priority heap to collect the top N (or
next N) entries.  I still have to walk the entire hit list, but keeping
entries in a priority heap means I can determine the N entries I need
with a few comparisons and minimal sorting.  I don't have to sort a
bunch of entries whose order I don't care about.  Additionally, I don't
have to have all of the entries in memory at one time.  The big
disadvantage with this is that I have to write more code.  However, it
may be worth it if the performance difference is large enough. 
Lucene's built-in sorting code already performs the optimization you 
describe as (2).  So don't bother re-inventing it!

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chuck Williams wrote:
I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize.  The pure vector space model implements a cosine in the strictly positive sector of the coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query).
I question whether such scores are more meaningful.  Yes, such scores 
would be guaranteed to be between zero and one, but would 0.8 really be 
meaningful?  I don't think so.  Do you have pointers to research which 
demonstrates this?  E.g., when such a scoring method is used, that 
thresholding by score is useful across queries?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Otis Gospodnetic wrote:
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is  X.  So that is where the absolue value of the
score would be useful.
Right, but the question is, would a single score threshold be effective 
for all queries, or would one need a separate score threshold for each 
query?  My hunch is that the latter is better, regardless of the scoring 
algorithm.

Also, just because Lucene's default scoring does not guarantee scores 
between zero and one does not necessarily mean that these scores are 
less meaningful.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chris Hostetter wrote:
For example, using the current scoring equation, if i do a search for
Doug Cutting and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to Doug Cutting
If I then do a search for Chris Hostetter and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1
...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)
However, I *cannot* say either of the following:
  x) document #9 is as relevant for Chris Hostetter as document #1 is
 relevant to Doug Cutting
  y) document #5 is equally relevant to both Chris Hostetter and
 Doug Cutting
That's right.  Thanks for the nice description of the issue.
I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x  y.
And I am not convinced that, with the changes Chuck describes, one can 
be any more confident of x and y.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: java.io.FileNotFoundException: ... (No such file or directory)

2004-12-08 Thread Doug Cutting
Justin Swanhart wrote:
The indexes are located on a NFS mountpoint. Could this be the
problem?
Yes.  Lucene's lock mechanism is designed to keep this from happening, 
but the sort of lock files that FSDirectory uses are known to be broken 
with NFS.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reoot site query results

2004-12-06 Thread Doug Cutting
In web search, link information helps greatly.  (This was Google's big 
discovery.)  There are lots more links that point to 
http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and 
many (if not most) of these links have the term slashdot, while links 
to http://www.slashdot.org/xxx/yyy are somewhat less likely to contain 
the term slashdot.

As Erik hinted, Nutch uses this information.  It keeps has a database of 
links that point to each page, indexes their anchor text along with the 
page, and boosts highly linked pages more than lesser linked pages.

Doug
Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
slashdot will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like
http://www.slashdot.org/ to be the very top result... i've added a
boost to queries that match the hostname field, which helped a little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly so
it appears before any sub pages? (assuming the query is in reference
to that site) ...
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-06 Thread Doug Cutting
Chuck Williams wrote:
I've got about 30k documents and have 3 indexing scenarios:
1.   Full indexing and optimize
2.   Incremental indexing and optimize
3.   Parallel incremental indexing without optimize
Search performance is critical.  For both cases 1 and 2, I'd like the
fastest possible indexing time.  For case 3, I'd like minimal pauses and
no noticeable degradation in search performance.
 

Based on reading the code (including the javadocs comments), I'm
thinking of values along these lines:
mergeFactor:  1000 during Full indexing, and during optimize (for both
cases 1 and 2); 10 during incremental indexing (cases 2 and 3)
1000 is too big of a mergeFactor for any practical purpose.
I don't see a point in using different mergeFactors in cases 1 and 2. 
If you're going to optimize before you search, then you want the fastest 
batch indexing mode.  I would use something like 50 for both cases 1 and 2.

For case 3, where unoptimized search performance is very important, I 
would use something smaller than 10.  For Technorati's blog search, 
which incrementally maintains a Lucene index with millions of documents, 
I used a mergeFactor of 2 in order to maximize search performance. 
Indexing performance on a single CPU is still adequate to keep up with 
the rate of change of today's blogosphere.

minMergeDocs:  1000 during Full indexing, 10 during incremental indexing
I see no reason to lower this when indexing incrementally.  1000 is a 
good value for high performance indexing when RAM is plentiful and 
documents are not too large.

maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
incremental indexing
1000 seems low to me, as it will result in too many segments, slowing 
search.  Here one should select the largest value that can be merged in 
the maximum time delay permitted in your application between a new 
document arriving and it appearing in search results.  So how up-to-date 
must your index be?  If it's okay for it to ocassionally be a few 
minutes out of date, then you can probably safely increase this to at 
least tens or hundreds of thousands, perhaps even millions.  When 
incrementally indexing, the most recently added segments stay cached in 
RAM by the filesystem.  So, on a system with a gigabyte of RAM that's 
dedicated to incremental indexing, you might safely set maxMergeDocs to 
account for a few hundred megabytes of index without encountering slow, 
i/o-bound merges.

Since mergeFactor is used in both addDocument() and optimize(), I'm
thinking of using two different values in case 2:  10 during the
incremental indexing, and then 1000 during the optimize.  Is changing
the value like this going to cause a problem?
It should not cause problems to use different mergeFactors at different 
times.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Too many open files issue

2004-11-26 Thread Doug Cutting
John Wang wrote:
In the Lucene code, I don't see where the reader speicified when
creating a field is closed. That holds on to the file.
I am looking at DocumentWriter.invertDocument()
It is closed in a finally clause on line 170, when the TokenStream is 
closed.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Doug Cutting
Hoss wrote:
The attachment contains my RangeFilter, a unit test that demonstrates it,
and a Benchmarking unit test that does a side-by-side comparison with
RangeQuery [6].  If developers feel that this class is useful, then by all
means roll it into the code base.  (90% of it is cut/pasted from
DateFilter/RangeQuery anyway)
+1
DateFilter could be deprecated, and replaced with the more generally and 
appropriately named RangeFilter.  Should we also deprecate DateField, in 
preference for DateTools?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Backup strategies

2004-11-16 Thread Doug Cutting
Christoph Kiehl wrote:
I'm curious about your strategy to backup indexes based on FSDirectory. 
If I do a file based copy I suspect I will get corrupted data because of 
concurrent write access.
My current favorite is to create an empty index and use 
IndexWriter.addIndexes() to copy the current index state. But I'm not 
sure about the performance of this solution.

How do you make your backups?
A safe way to backup is to have your indexing process, when it knows the 
index is stable (e.g., just after calling IndexWriter.close()), make a 
checkpoint copy of the index by running a shell command like cp -lpr 
index index.YYYMMDDHHmmSS.  This is very fast and requires little disk 
space, since it creates only a new directory of hard links.  Then you 
can separately back this up and subsequently remove it.

This is also a useful way to replicate indexes.  On the master indexing 
server periodically perform cp -lpr as above.  Then search slaves can 
use rsync to pull down the latest version of the index.  If a very small 
mergefactor is used (e.g., 2) then the index will have only a few 
segments, so that searches are fast.  On the slave, periodically find 
the latest index.YYYMMDDHHmmSS, use cp -lpr index/ index.YYYMMDDHHmmSS 
and 'rsync --delete master:index.YYYMMDDHHmmSS index.YYYMMDDHHmmSS' to 
efficiently get a local copy, and finally ln -fsn index.YYYMMDDHHmmSS 
index to publish the new version of the index.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: document ID and performance

2004-11-16 Thread Doug Cutting
Yan Pujante wrote:
I want to run a very fast search that simply returns the matching 
document id. Is there any way to associate the document id returned in 
the hit collector to the internal document ID stored in the index ? 
Anybody has any idea how to do that ? Ideally you would want to be able 
to write something like this:

document.add(Field.ID(documentID));
and then in the HitCollector API:
collect(String documentID, float score) with the documentID being the 
one you stored (but which would be returned very efficiently)
Have a look at:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html
In your HitCollector, access an array, from the field cache, that maps 
Lucene ids to your ids.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search speed

2004-11-02 Thread Doug Cutting
Jeff Munson wrote:
Single word searches return pretty fast, but when I try phrases,
searching seems to slow considerably. [ ... ]
However, if I use this query, contents:all parts including picture tube
guaranteed, it returns hits in 2890 millseconds.  Other phrases take
longer as well.  
You could use an analyzer that inserts bigrams for common terms.  Nutch 
does this.  So, if you declare that all and including are common 
terms, then this could be tokenized as the following tokens:

0 - all all.parts
1 - parts parts.including
2 - including including.picture
3 - picture
4 - tube
5 - guaranteed
Two tokens at a position indicate where the second has position 
increment of zero.

Then your phrase search could be converted to:
  all.parts parts.including including.picture picture tube guaranteed
which should be much faster, since it has replaced common terms with 
rare terms.

This approach does make the index larger, and hence makes indexing 
somewhat slower.  So you don't want to declare too many words as common, 
but a handful can make a big difference if they're used frequently in 
queries.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting and score ordering

2004-10-13 Thread Doug Cutting
Paul Elschot wrote:
Along with that, is there a simple way to assign a new scorer to the
searcher? So I can use the same lucene algorithm for my hits, but
tweak it a little to fit my needs?

There is no one to one relationship between a seacher and a scorer.
But you can use a different Similarity implementation with each Searcher.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort regeneration in multithreaded server

2004-10-08 Thread Doug Cutting
Stephen Halsey wrote:
I was wondering if anyone could help with a problem (or should that be
challenge?) I'm having using Sort in Lucene over a large number of records
in multi-threaded server program on a continually updated index.
I am using lucene-1.4-rc3.
A number of bugs with the sorting code have been fixed since that 
release.  Can you please try with 1.4.2 and see if you still have the 
problem?  Thanks.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: locking problems

2004-10-08 Thread Doug Cutting
Aad Nales wrote:
1. can I have one or multiple searchers open when I open a writer?
2. can I have one or multiple readers open when I open a writer?
Yes, with one caveat: if you've called the IndexReader methods delete(), 
undelete() or setNorm() then you may not open an IndexWriter until 
you've closed that IndexReader instance.

In general, only a single object may modify an index at once, but many 
may access it simultaneously in a read-only manner, including while it 
is modified.  Indexes are modified by either an IndexWriter or by the 
IndexReader methods delete(), undelete() and setNorm().

Typically an application which modifies and searches simultaneously 
should keep the following open:

  1. A single IndexReader instance used for all searches, perhaps 
opened via an IndexSearcher.  Periodically, as the index changes, this 
is discarded, and replaced with a new instance.

  2. Either:
 a. An IndexReader to delete documents.
 b. An IndexWriter to add documents; or
So an updating thread might open (2a), delete old documents, close it, 
then open (2b) add new documents, perhaps optimize, then close.  At this 
point, when the index has been updated (1) can be discarded and replaced 
with a new instance.  Typically the old instance of (1) is not 
explicitly closed, rather the garbage collector closes it when the last 
thread searching it completes.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: multifield-boolean vs singlefield-enum query performance

2004-10-07 Thread Doug Cutting
Tea Yu wrote:
For the following implementations:
1) storing boolean strings in fields X and Y separately
2) storing the same info in a field XY as 3 enums: X, Y, B, N meaning only X
is True, only Y is True, both are True or both are False
Is there significant performance gain when we substitute X:T OR Y:T by
XY:B, while significant loss in X:T by XY:X OR XY:B?  Or are they
negligible?
As with most performance questions, it's best to try both and measure! 
It depends on the size of your index, the relative frequencies of X and 
Y, etc.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: removing duplicate Documents from Hits

2004-10-01 Thread Doug Cutting
Timm, Andy (ETW) wrote:
Hello, I've searched on previous posts on this topic but couldn't find an answer.  I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field.  In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...).  My index needs to have each of these records as 'B' is a searchable field in the index.  However, after the query is executed, I want my resulting Hits on be unique on 'A'.  I'm only returning the Oracle object ID, so once I've seen it once I don't need it again.  It looks like some sort of custom Filter is in order.
I'd suggest a HitCollector that uses a FieldCache of the A values to 
check for duplicates, and collect only a the best document id for each 
value of A.  This would use a bit of RAM, but be very fast.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/HitCollector.html
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


new release: 1.4.2

2004-10-01 Thread Doug Cutting
There's a new release of Lucene, 1.4.2, which mostly fixes bugs in 
1.4.1.  Details are at http://jakarta.apache.org/lucene/.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-29 Thread Doug Cutting
Bastian Grimm [Eastbeam GmbH] wrote:
that works... but i have to do this setNorm() for each document, which 
has been indexed up to now, right? there are round about 1 mio. docs in 
the index... i dont think it's a good idea to perform a search and do it 
for every doc (and every field of the doc...).
is there any possibility to do something like: setNorm(alldocs, 
fieldX, 2.0f) - a global boost for a named field for every doc.
setNorm() is quite fast.  Calling it 1M times will not take long.
a last question: lucene creates some .f[1-9]  after setNorm() has 
finished. does this file remain all the time in this folder? i tried to 
optimize and so one but nothing happend.
If you add or remove documents and optimize then these will go away.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Shouldnt IndexWriter.flushRamSegments() be public? or at least protected?

2004-09-28 Thread Doug Cutting
Christian Rodriguez wrote:
Now the problem I have is that I dont have a way to force a flush of
the IndexWriter without closing it and I need to do that before
commiting a transaction or I would get random errors. Shouldnt that
function be public, in case the user wants to force a flush at some
point that is not when the IndexWriter is closed? If not I am forced
to create a new IndexWriter and close it EVERY TIME I commit a
transaction (which in my application is very often).
Opening and closing IndexWriters should be a lightweight operation. 
Have you tried this and found it to be too slow?  A flush() would have 
to do just about the same work.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-23 Thread Doug Cutting
You can change field boosts without re-indexing.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#setNorm(int,%20java.lang.String,%20byte)
Doug
Bastian Grimm [Eastbeam GmbH] wrote:
thanks for your reply, eric.
so i am right that its not possible to change the boost without 
reindexing all files? thats not good... or is it ok only to change the 
boosts an optimize the index to take changes effecting the index?

if not, will i be able to boost those fields in the searcher?
thanks, bastian
-
The boost is not thrown away, but rather combined with the length 
normalization factor during indexing.  So while your actual boost value 
is not stored directly in the index, it is taken into consideration for 
scoring appropriately.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: demo HTML parser question

2004-09-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.
That's almost right.  I originally wrote it that way to avoid having to 
ever buffer the entire text of the document.  The document is indexed 
while it is parsed.  But, as observed, this has lots of problems and was 
probably a bad idea.

Could someone provide a patch that removes the multi-threading?  We'd 
simply use a StringBuffer in HTMLParser.jj to collect the text.  Calls 
to pipeOut.write() would be replaced with text.append().  Then have the 
HTMLParser's constructor parse the page before returning, rather than 
spawn a thread, and getReader() would return a StringReader.  The public 
API of HTMLParser need not change at all and lots of complex threading 
code would be thrown away.  Anyone interested in coding this?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document contents split among different Fields

2004-09-23 Thread Doug Cutting
Greg Langmead wrote:
Am I right in saying that the design of Token's support for highlighting
really only supports having the entire document stored as one monolithic
contents Field?
No, I don't think so.
Has anyone tackled indexing multiple content Fields
before that could shed some light?
Do you need highlights from all fields?  If so, then you can use:
  TextFragment[] getBestTextFragments(TokenStream, ...);
with a TokenStream for each field, then select the highest scoring 
fragments across all fields.  Would that work for you?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Running OutOfMemory while optimizing and searching

2004-09-17 Thread Doug Cutting
John Z wrote:
We have indexes of around 1 million docs and around 25 searchable fields.
We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB.
 
Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. 
 
Doug, 
 
Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory  or am I not making any sense ? 
 
1 byte * Number of searchable fields in your index * Number of docs in 
your index
plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
You make perfect sense.  The formula above does not include the .tii. 
My mistake: I forgot that.  By default, every 128th Term in the index is 
read into memory, to permit random access to terms.  These are stored in 
the .tii file, compressed.  So it is not surprising that they require 7x 
the size of the .tii file in memory.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
Andrzej Bialecki wrote:
I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.
Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly 
slower.  The problem is that they don't handle the case where phrase 
elements are missing altogether, while a BooleanQuery does.  So what you 
really need is maybe a variation of a sloppy PhraseQuery that scores 
matches that do not contain all of the terms...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).
Almost.
If the user enters a recursize purser, then: a, which is in, say, 
50% of the documents, is probably spelled correctly and recursize, 
which is in zero documents, is probably mispelled.  But what about 
purser?  If we run the spell check algorithm on purser and generate 
parser, should we show it to the user?  If purser occurs in 1% of 
documents and parser occurs in 5%, then we probably should, since 
parser is a more common word than purser.  But if parser only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting parser.

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does purser or parser occur 
more frequently near descent.  But that gets expensive.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Doug Cutting
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is 
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works 
for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: Daniel Taurat 
[EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 

number of
 

documents

   

Daniel Aber schrieb:
 
 

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

  

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix 


about files
 

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread Doug Cutting
Daniel Naber wrote:
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

I have not been
able to construct a two-word query that returns a page without both
words in either the content, the title, the url or in a single anchor.
Can you?

Like this one?
konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.
Good job finding that!  I guess I should fix Nutch's BasicQueryFilter.
Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the did you mean spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).
I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Doug Cutting
Bill Janssen wrote:
I'd think that if a user specified a query cutting lucene, with an
implicit AND and the default fields title and author, they'd
expect to see a match in which both cutting and lucene appears.  That is,
(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
Your proposal is certainly an improvement.
It's interesting to note that in Nutch I implemented something 
different.  There, a search for cutting lucene expands to something like:

 (+url:cutting^4.0 +url:lucene^4.0 +url:cutting lucene~2147483647^4.0)
 (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:cutting lucene~4^2.0)
 (+content:cutting +content:lucene +content:cutting lucene~2147483647)
So a page with cutting in the body and lucene in anchor text won't 
match: the body, anchor or url must contain all query terms.  A single 
authority (content, url or anchor) must vouch for all attributes.

Note that Nutch also boosts matches where the terms are close together. 
 Using ~2147483647 permits them to be anywhere in the document, but 
boosts more when they're closer and in-order.  (The ~4 in anchor 
matches is to prohibit matches across different anchors.  Each anchor is 
separated by a Token.positionIncrement() of 4.)

But perhaps this is not a feature.  Perhaps Nutch should instead expand 
this to:

 +(url:cutting^4.0 anchor:cutting^2.0 content:cutting)
 +(url:lucene^4.0 anchor:lucene^2.0 content:lucene)
 url:cutting lucene~2147483647^4.0
 anchor:cutting lucene~4^2.0
 content:cutting lucene~2147483647
That would, e.g., permit a match with only lucene in an anchor and 
cutting in the content, which the earlier formulation would not.

Can anyone tell whether Google has this requirement?  I have not been 
able to construct a two-word query that returns a page without both 
words in either the content, the title, the url or in a single anchor. 
Can you?

If you're interested, the Nutch query expansion code in question is:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup
To play with it you can download Nutch and use the command:
  bin/nutch net.nutch.searcher.Query
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.
But, inspired by that message, couldn't MultiFieldQueryParser just be a 
subclass of QueryParser that overrides getFieldQuery()?

Cheers,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: has anybody
tried this before? 
Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only corrections 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
David Spencer wrote:
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:
Not that I know of.
- ignore(penalize?) terms that are rare
I think this one is easy to threshold: ignore matching terms that are 
rarer than the term entered.

- ignore(penalize?) terms that are common
This, in effect, falls out of the previous criterion.  A term that is 
very common will not have any matching terms that are more common.  As 
an optimization, you could avoid even looking for matching terms when a 
term is very common.

- terms that are closer (string distance) to the term entered are better
This is the meaty one.
- terms that start w/ the same 'n' chars as the users term are better
Perhaps.  Are folks really better at spelling the beginning of words?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: maximum index size

2004-09-08 Thread Doug Cutting
Chris Fraschetti wrote:
I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?
Lucene can currently handle up to 2^31 documents in a single index.  To 
a large degree this is limited by Java ints and arrays (which are 
accessed by ints).  There are also a few places where the file format 
limits things to 2^32.

On typical PC hardware, 2-3 word searches of an index with 10M 
documents, each with around 10k of text, require around 1 second, 
including index i/o time.  Performance is more-or-less linear, so that a 
100M document index might require nearly 10 seconds per search.  Thus, 
as indexes grow folks tend to distribute searches in parallel to many 
smaller indexes.  That's what Nutch and Google 
(http://www.computer.org/micro/mi2003/m2022.pdf) do.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: telling one version of the index from another?

2004-09-07 Thread Doug Cutting
Bill Janssen wrote:
Hi.
Hey, Bill.  It's been a long time!
I've got a Lucene application that's been in use for about two years.
Some users are using Lucene 1.2, some 1.3, and some are moving to 1.4.
The indices seem to behave differently under each version.  I'd like
to add code to my application that checks the current user's index
version against the version of Lucene that they are using, and
automatically re-indexes their files if necessary.  However, I can't
figure out how to tell the version, from the index files.
Prior to 1.4, there were no format numbers in the index.  These are 
being added, file-by-file, as we change file formats.  As you've 
discovered, there is currently no public API to obtain the format number 
of an index.  Also, the formats of different files are revved at 
different times, so there may not be a single format number for the 
entire index.  (Perhaps we should remedy this, by, e.g., always revving 
the segments version whenever any file changes format.)

The documentation on the file formats, at
http://jakarta.apache.org/lucene/docs/fileformats.html, directs me to
the segments file.  However, when I look at a version 1.3 segments
file, it seems to bear little relationship to the format described in
fileformats.html. 
Have a look at the version of fileformats.html that shipped with 1.3. 
You can find this by browsing CVS, looking for the 1.3-final tag.  But 
let me do it for you:

http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/docs/fileformats.html?rev=1.15
According to CVS tags, that describes both the 1.3 and 1.2 index file 
formats.

But the part of fileformats.html dealing with the
segments file contains no compatibility notes, so I assume it hasn't
changed since 1.3. 
I wrote the bit about compatibility notes when I first documented file 
formats, and then promptly forgot about it.  So, until someone 
contributes them, there are no compatibility notes.  Sorry.

Even if it had, what's the idea of using -1 as the
format number for 1.4?
The idea is to promptly break 1.3 and 1.2 code which tries to read the 
index.  Those versions of Lucene don't check format numbers (because 
there were none).  Positive values would give unpredictable errors.  A 
negative value causes an immediate failure.

So, anyone know a way to tell the difference between the various
versions of the index files?  Crufty hacks welcome :-).
The first four bytes of the segments file will mostly do the trick. 
If it is zero or positive, then the index is a 1.2 or 1.3 index.  If it 
is -2, then it's a 1.4-final or later index.

There was a change in formats between 1.2 and 1.3, with no format number 
change.  This was in 1.3 RC1 (note #12 in CHANGES.txt).  The semantics 
of each byte in norm files (.f[0-9]) changed.  In 1.3 each byte 
represented 0.0-255.0 on a linear scale.  In 1.3 and later they're 
eight-bit floats (three-bit mantissa, five-bit exponent, no sign bit). 
The net result is that if you use a 1.2 index with 1.3 or later then the 
correct documents will be returned, but scores and rankings will be wacky.

With the exception of this last bit, 1.4 should be able to correctly 
handle indexes from earlier releases.  Please report if this is not the 
case.

Cheers,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to remove duplicate documents in sort API?

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote:
My problem is that I have two machines... one for searching, one for 
indexing.

The searcher has an existing index.
The indexer found an UPDATED document and then adds it to a new index 
and pushes that new index over to the searcher.

The searcher then reloads and when someone performs a search BOTH 
documents could show up (including the stale document).

I can't do a delete() on the searcher because the indexer doesn't have 
the entire index as the searcher.
I can think of a couple ways to fix this.
If the indexer box kept copies of the indexes that it has already sent 
to the searcher, then it can mark updated documents as deleted in these 
old indexes.  Then you can, with the new index, also distribute new .del 
files for the old indexes.

Alternately, you could, on the searcher box, before you open the new 
index, open an IndexReader on all of the existing indexes and mark all 
new documents as deleted in the old indexes.  This shouldn't take more 
than a few seconds.

IndexReader.delete() just sets a bit in a bit vector that is written to 
file by IndexReader.close().  So it's quite fast.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote:
It looks like Document.java uses its own implementation of a LinkedList..
Why not use a HashMap to enable O(1) lookup... right now field lookup is 
O(N) which is certainly no fun.

Was this benchmarked?  Perhaps theres the assumption that since 
documents often have few fields the object overhead and hashcode 
overhead would have been less this way.
I have never benchmarked this but would be surprised if it makes a 
measureable difference in any real application.  A linked list is used 
because it naturally supports multiple entries with the same key.  A 
home-grown linked list was used because, when Lucene was first written, 
java.util.LinkedList did not exist.

Please feel free to benchmark this against a HashMap of LinkedList of 
Field.  This would be slower to construct, which may offset any 
increased access speed.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: speeding up queries (MySQL faster)

2004-08-22 Thread Doug Cutting
Yonik Seeley wrote:
Setup info  Stats:
- 4.3M documents, 12 keyword fields per document, 11
 [ ... ]
field1:4 AND field2:188453 AND field3:1
field1:4  done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1  done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).
The field1:4 clause is probably dominating the cost of query 
execution.  Clauses which match large portions of the collection are 
slow to evaluate.  If there are not too many different such clauses then 
you can optimize this by re-using a Filter in place of such clauses, 
typically a QueryFilter.

For example, Nutch automatically translates such clauses into 
QueryFilters.  See:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup
Note that this only converts clauses whose boost is zero.  Since filters 
do not affect ranking we can only safely convert clauses which do not 
contribute to the score, i.e, those whose boost is zero.  Scores might 
still be different in the filtered results because of 
Similarity.coord().  But, in Nutch, Similarity.coord() is overidden to 
always return 1.0, so that the replacement of clauses with filters does 
not alter the final scores at all.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NegativeArraySizeException when creating a new IndexSearcher

2004-08-20 Thread Doug Cutting
Looks to me like you're using an older version of Lucene on your Linux 
box.  The code is back-compatible, it will read old indexes, but Lucene 
1.3 cannot read indexes created by Lucene 1.4, and will fail in the way 
you describe.

Doug
Sven wrote:
Hi!
I have a problem to port a Lucene based knowledgebase from Windows to Linux.
On Windows it works fine whereas I get a NegativeArraySizeException on Linux
when I try to initialise a new IndexSearcher to search the index. Deleting
and rebuilding the index didn't help. I checked permissions, file path and
lock_dir but as far as I can say they seem to be all right. As I couldn't
find another one with the same problem I guess I've overlooked sth, but I've
run out of ideas. I use lucene-1.4-rc2 and tomcat 5.0.18. Can someone help
me please with this or has an idea?
Kind regards,
Sven
java.lang.NegativeArraySizeException
 at
org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:106)
 at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:82)
 at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:141)
 at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:120)
 at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
 at org.apache.lucene.store.Lock$With.run(Lock.java:148)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:99)
 at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:75)
 at
com.sykon.knowledgebase.action.ListQueryResultAction.act(ListQueryResultActi
on.java:134)
 at
org.apache.cocoon.components.treeprocessor.sitemap.ActTypeNode.invoke(ActTyp
eNode.java:159)
 at
org.apache.cocoon.components.treeprocessor.sitemap.ActionSetNode.call(Action
SetNode.java:121)
 at
org.apache.cocoon.components.treeprocessor.sitemap.ActSetNode.invoke(ActSetN
ode.java:98)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:84)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PreparableMatchNode.invok
e(PreparableMatchNode.java:165)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(Pipel
ineNode.java:162)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(Pipe
linesNode.java:136)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:371)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:312)
 at
org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNod
e.java:133)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:84)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PreparableMatchNode.invok
e(PreparableMatchNode.java:165)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(Pipel
ineNode.java:162)
 at
org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo
keNodes(AbstractParentProcessingNode.java:107)
 at
org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(Pipe
linesNode.java:136)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:371)
 at
org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess
or.java:312)
 at org.apache.cocoon.Cocoon.process(Cocoon.java:656)
 at org.apache.cocoon.servlet.CocoonServlet.service(CocoonServlet.java:1112)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:284)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:204)
 at
org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.
java:742)
 at
org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDis
patcher.java:506)
 at
org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatch
er.java:443)
 at
org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher
.java:359)
 at
org.apache.jasper.runtime.PageContextImpl.doForward(PageContextImpl.java:712
)
 at
org.apache.jasper.runtime.PageContextImpl.forward(PageContextImpl.java:682)
 at
org.apache.jsp.knowlegebase.controller_jsp._jspService(controller_jsp.java:8
44)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:133)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
 at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
11)
 at 

Re: Debian build problem with 1.4.1

2004-08-20 Thread Doug Cutting
I can successfully use gcc 3.4.0 with Lucene as follows:
ant jar jar-demo
gcj -O3 build/lucene-1.5-rc1-dev.jar build/lucene-demos-1.5-rc1-dev.jar 
-o indexer --main=org.apache.lucene.demo.IndexHTML

./indexer -create docs
It runs pretty snappy too!  However I don't know if there's much milage 
in packaging Lucene as a native library.  It's easy enough for folks to 
compile Lucene this way, and applications built this way are pretty 
small.  The big thing to install is libgcj.

Doug
Jeff Breidenbach wrote:
Ok, Lucene 1.4.1 has been uploaded to Debian. Hopefully it will have
enough time to percolate before the sarge release.
Now that that is taken care of, I'm curious about the status of gcj
compilation. Packaging Lucene as a native library might be useful for
projects such as PyLucene, and it is also advantageous for license
reasons i.e. avoiding the non-free JVM dependency. What's the current
gcj compilation recipe? The best I could find on Google (below) seems
a little bit stale.
http://www.mail-archive.com/[EMAIL PROTECTED]/msg04131.html
Cheers,
Jeff

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hit Score [ Between ]

2004-08-04 Thread Doug Cutting
You could instead use a HitCollector to gather only documents with 
scores in that range.

Doug
Karthik N S wrote:
Hi 

Apologies
If I want to get all the  hits for Scores  between  0.5f  to 0.8f, 
I usally use
query = QueryParser.parse(srchkey,Fields, analyzer);
int tothits = searcher.search(query);

for (int i = 0; itothits ; i++) {
docs = hits.doc(i);
Score = hits.score(i);
 
if ((Score  0.5f )  (Score  0.8f) ) {
System.out.println( FileName  :  + docs.get(filename);
}
}

Is there any other way to Do this ,
Please Advise me..
Thx.

  WITH WARM REGARDS 
  HAVE A NICE DAY 
  [ N.S.KARTHIK] 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Split an existing index into smaller segments without a re-index?

2004-08-04 Thread Doug Cutting
Kevin A. Burton wrote:
Is it possible to take an existing index (say 1G) and break it up into a 
number of smaller indexes (say 10 100M indexes)...

I don't think theres currently an API for this but its certainly 
possible (I think).
Yes, it is theoretically possible but not yet implemented.
An easy way to implement it would be to subclass FilterIndexReader to 
return a subset of documents, then use IndexWriter.addIndexes() to write 
out each subset as a new index.  Subsets could be ranges of document 
numbers, and one could use TermPositions.skipTo() to accelerate the 
TermPositions subset implementation, but this still wouldn't be quite as 
fast as an index splitter that only reads each TermPositions once.  If 
we added a lower-level index writing API then one could use that to 
implement this...

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Caching of TermDocs

2004-07-27 Thread Doug Cutting
John Patterson wrote:
I would like to hold a significant amount of the index in memory but use the
disk index as a spill over.  Obviously the best situation is to hold in
memory only the information that is likely to be used again soon.  It seems
that caching TermDocs would allow popular search terms to be searched more
efficiently while the less common terms would need to be read from disk.
The operating system already caches recent disk i/o.  So what you'd save 
primarily would be the overhead of parsing the data.  However the parsed 
form, a sequence of docNo and freq ints, is nearly eight times as large 
as its compressed size in the index.  So your cache would consume a lot 
of memory.

Whether it this provide much overall speedup depends on the distribution 
of common terms in your query traffic.  If you have a few terms that are 
searched very frequently then it might pay off.  In my experience with 
general-purpose search engines this is not usually the case: folks seem 
to use rarer words in queries than they do in ordinary text.  But in 
some search applications perhaps the traffic is more skewed.  Only some 
experiments would tell for sure.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Logic of score method in hits class

2004-07-26 Thread Doug Cutting
Lucene scores are not percentages.  They really only make sense compared 
to other scores for the same query.  If you like percentages, you can 
divide all scores by the first score and multiply by 100.

Doug
lingaraju wrote:
Dear  All
How the score method works(logic) in Hits class
For 100% match also score is returning only 69% 

Thanks and regards
Raju
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Boosting documents

2004-07-26 Thread Doug Cutting
Rob Clews wrote:
I want to do the same, set a boost for a field containing a date that
lowers as the date is further from now, is there any way I could do
this?
You could implement Similarity.idf(Term, Searcher) to, when 
Term.field().equals(date), return a value that is greater for more 
recent dates.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: over 300 GB to index: feasability and performance issue

2004-07-26 Thread Doug Cutting
Vincent Le Maout wrote:
I have to index a huge, huge amount of data: about 10 million documents
making up about 300 GB. Is there any technical limitation in Lucene that
could prevent me from processing such amount (I mean, of course, apart
from the external limits induce by the hardware: RAM, disks, the system,
whatever) ?
Lucene is in theory able to support up to 2B documents in a single 
index.  Folks have sucessfully built indexes with several hundred 
million documents.  10 million should not be a problem.

If possible, does anyone have an idea of the amount of resource
needed: RAM, CPU time, size of indexes, access time on such a collection ?
if not, is it possible to extrapolate an estimation from previous 
benchmarks ?
For simple 2-3 term queries, with average sized documents (~10k of text) 
you should get decent performance (1 second / query) on a 10M document 
index.  An index typically requires around 35% of the plain text size.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort: 1.4-rc3 vs. 1.4-final

2004-07-21 Thread Doug Cutting
The key in the WeakHashMap should be the IndexReader, not the Entry.  I 
think this should become a two-level cache, a WeakHashMap of HashMaps, 
the WeakHashMap keyed by IndexReader, the HashMap keyed by Entry.  I 
think the Entry class can also be changed to not include an IndexReader 
field.  Does this make sense?  Would someone like to construct a patch 
and submit it to the developer list?

Doug
Aviran wrote:
I think I found the problem
FieldCacheImpl uses WeakHashMap to store the cached objects, but since there
is no other reference to this cache it is getting released.
Switching to HashMap solves it.
The only problem is that I don't see anywhere where the cached object will
get released if you open a new IndexReader.
Aviran
-Original Message-
From: Greg Gershman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 21, 2004 13:13 PM
To: Lucene Users List
Subject: RE: Sort: 1.4-rc3 vs. 1.4-final

I've done a bit more snooping around; it seems that in
FieldSortedHitQueue.getCachedComparator(line 153), calls to lookup a stored
comparator in the cache always return null.  This occurs even for the
built-in sort types (I tested it on integers and my code for longs).  The
comparators don't even appear to be being stored in the HashMap to begin
with.
Any ideas?
Greg
 

--- Aviran [EMAIL PROTECTED] wrote:
Since I had to implement sorting in lucene 1.2 I had
to write my own sorting
using something similar to a lucene's contribution
called SortField.
Yesterday I did some tests, trying to use lucene 1.4
Sort objects and I
realized that my old implementation works 40% faster
then Lucene's
implementation. My guess is that you are right and
there is a problem with
the cache although I couldn't find what that is yet.
Aviran
-Original Message-
From: Greg Gershman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 21, 2004 9:22 AM
To: [EMAIL PROTECTED]
Subject: Sort: 1.4-rc3 vs. 1.4-final
When rc3 came out, I modified the classes used for
Sorting to, in addition to Integer, Float and
String-based sort keys, use Long values.  All I did
was add extra statements in 2 classes (SortField and
FieldSortedHitQueue) that made a special case for
longs, and created a LongSortedHitQueue identical to
the IntegerSortedHitQueue, only using longs.
This worked as expected; Long values converted to
strings and stored in Field.Keyword type fields
would
be sorted according to Long order.  The initial
query
would take a while, to build the sorted array, but
subsequent queries would take little to no time at
all.
I went back to look at 1.4 final, and noticed the
Sort implementation has
changed quite a bit.  I tried the same type of
modifications to the existing
source files, but was unable to achieve similiar
results.
Each subsequent query seems to take a significant
amount of time, as if the Sorted array is being
rebuilt each time.  Also, I tried sorting on an
Integer fields and got similar results, which leads
me
to believe there might be a caching problem
somewhere.
Has anyone else seen this in 1.4-final?  Also, I
would
like it if Long sorted fields could become a part of
the API; it makes sorting by date a breeze.
Thanks!
Greg Gershman

__
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





__
Do you Yahoo!?
Vote for the stars of Yahoo!'s next ad campaign!
http://advision.webevents.yahoo.com/yahoo/votelifeengine/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Weighting database fields

2004-07-21 Thread Doug Cutting
Ernesto De Santis wrote:
If some field have set a boots value in index time, and when in search time
the query have another boost value for this field, what happens?
which value is used for boost?
The two boosts are both multiplied into the score.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Post-sorted inverted index?

2004-07-20 Thread Doug Cutting
You can define a subclass of FilterIndexReader that re-sorts documents 
in TermPositions(Term) and document(int), then use 
IndexWriter.addIndexes() to write this in Lucene's standard format.  I 
have done this in Nutch, with the (as yet unused) IndexOptimizer.

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/indexer/IndexOptimizer.java?view=markup
Doug
Aphinyanaphongs, Yindalon wrote:
I gather from reading the documentation that the scores for each document hit are computed at query time.  I have an application that, due to the complexity of the function, cannot compute scores at query time.  Would it be possible for me to store the documents in pre-sorted order in the inverted index? (i.e. after the initial index is created, to have a post processing step to sort and reindex the final documents).
 
For example:
Document A - score 0.2
Document B - score 0.4
Document C - score 0.6
 
Thus for the word 'the', the stored order in the index would be C,B,A.
 
Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Scoring without normalization!

2004-07-15 Thread Doug Cutting
Have you looked at:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
in particular, at:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#queryNorm(float)
Doug
Jones G wrote:
Sadly, I am still running into problems
Explain shows the following after the modification.
Rank: 1 ID: 11285358Score: 5.5740864E8
5.5740864E8 = product of:
  8.3611296E8 = sum of:
8.3611296E8 = product of:
  6.6889037E9 = weight(title:iron in 1235940), product of:
0.12621856 = queryWeight(title:iron), product of:
  7.0507255 = idf(docFreq=10816)
  0.017901499 = queryNorm
5.2994613E10 = fieldWeight(title:iron in 1235940), product of:
  1.0 = tf(termFreq(title:iron)=1)
  7.0507255 = idf(docFreq=10816)
  7.5161928E9 = fieldNorm(field=title, doc=1235940)
  0.125 = coord(1/8)
2.7106019E-8 = product of:
  1.08424075E-7 = sum of:
5.7318403E-9 = weight(abstract:an in 1235940), product of:
  0.03711049 = queryWeight(abstract:an), product of:
2.073038 = idf(docFreq=1569960)
0.017901499 = queryNorm
  1.5445337E-7 = fieldWeight(abstract:an in 1235940), product of:
1.0 = tf(termFreq(abstract:an)=1)
2.073038 = idf(docFreq=1569960)
7.4505806E-8 = fieldNorm(field=abstract, doc=1235940)
1.0269223E-7 = weight(abstract:iron in 1235940), product of:
  0.111071706 = queryWeight(abstract:iron), product of:
6.2046037 = idf(docFreq=25209)
0.017901499 = queryNorm
  9.24558E-7 = fieldWeight(abstract:iron in 1235940), product of:
2.0 = tf(termFreq(abstract:iron)=4)
6.2046037 = idf(docFreq=25209)
7.4505806E-8 = fieldNorm(field=abstract, doc=1235940)
  0.25 = coord(2/8)
  0.667 = coord(2/3)
Rank: 2 ID: 8157438 Score: 2.7870432E8
2.7870432E8 = product of:
  8.3611296E8 = product of:
6.6889037E9 = weight(title:iron in 159395), product of:
  0.12621856 = queryWeight(title:iron), product of:
7.0507255 = idf(docFreq=10816)
0.017901499 = queryNorm
  5.2994613E10 = fieldWeight(title:iron in 159395), product of:
1.0 = tf(termFreq(title:iron)=1)
7.0507255 = idf(docFreq=10816)
7.5161928E9 = fieldNorm(field=title, doc=159395)
0.125 = coord(1/8)
  0.3334 = coord(1/3)
Rank: 3 ID: 10543103Score: 2.7870432E8
2.7870432E8 = product of:
  8.3611296E8 = product of:
6.6889037E9 = weight(title:iron in 553967), product of:
  0.12621856 = queryWeight(title:iron), product of:
7.0507255 = idf(docFreq=10816)
0.017901499 = queryNorm
  5.2994613E10 = fieldWeight(title:iron in 553967), product of:
1.0 = tf(termFreq(title:iron)=1)
7.0507255 = idf(docFreq=10816)
7.5161928E9 = fieldNorm(field=title, doc=553967)
0.125 = coord(1/8)
  0.3334 = coord(1/3)
Rank: 4 ID: 8753559 Score: 2.7870432E8
2.7870432E8 = product of:
  8.3611296E8 = product of:
6.6889037E9 = weight(title:iron in 2563152), product of:
  0.12621856 = queryWeight(title:iron), product of:
7.0507255 = idf(docFreq=10816)
0.017901499 = queryNorm
  5.2994613E10 = fieldWeight(title:iron in 2563152), product of:
1.0 = tf(termFreq(title:iron)=1)
7.0507255 = idf(docFreq=10816)
7.5161928E9 = fieldNorm(field=title, doc=2563152)
0.125 = coord(1/8)
  0.3334 = coord(1/3)
I would like to get rid of all normalizations and just have TF and IDF.
What am I missing?
On Thu, 15 Jul 2004 Anson Lau wrote :
If you don't mind hacking the source:
In Hits.java
In method getMoreDocs()

   // Comment out the following
   //float scoreNorm = 1.0f;
   //if (length  0  scoreDocs[0].score  1.0f) {
   //  scoreNorm = 1.0f / scoreDocs[0].score;
   //}
   // And just set scoreNorm to 1.
   int scoreNorm = 1;
I don't know if u can do it without going to the src.
Anson
-Original Message-
From: Jones G [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 6:52 AM
To: [EMAIL PROTECTED]
Subject: Scoring without normalization!
How do I remove document normalization from scoring in Lucene? I just want
to stick to TF IDF.
Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Token or not Token, PerFieldAnalyzer

2004-07-15 Thread Doug Cutting
Florian Sauvin wrote:
Everywhere in the documentation (and it seems logical) you say to use
the same analyzer for indexing and querying... how is this handled on
not tokenized fields?
Imperfectly.
The QueryParser knows nothing about the index, so it does not know which 
fields were tokenized and which were not.  Moreover, even the index does 
not know this, since you can freely intermix tokenized and untokenized 
values in a single field.

In my case, I have certain fields on which I want the tokenization and
anlysis and everything to happen... but on other fields, I just want to
index the content as it is (no alterations at all) and not analyze at
query time... is that possible?
It is very possible.  A good way to handle this is to use 
PerFieldAnalyzerWrapper.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: release migration plan

2004-07-15 Thread Doug Cutting
fp235-5 wrote:
I am looking at the code to implement setIndexInterval() in IndexWriter. I'd
like to have your opinion on the best way to do it.
Currently the creation of an instance of TermInfosWriter requires the following
steps:
...
IndexWriter.addDocument(Document)
IndexWriter.addDocument(Document, Analyser)
DocumentWriter.addDocument(String, Document)
DocumentWriter.writePostings(Posting[],String)
TermInfosWriter.init
To give a different value to indexInterval in TermInfosWriter, we need to add a
variable holding this value into IndexWriter and DocumentWriter and modify the
constructors for DocumentWriter and TermInfosWriter. (quite heavy changes)
I think this is the best approach.  I would replace other parameters in 
these constructors which can be derived from an IndexWriter with the 
IndexWriter.  That way, if we add more parameters like this, they can 
also be passed in through the IndexWriter.

All of the parameters to the DocumentWriter constructor are fields of 
IndexWriter.  So one can instead simply pass a single parameter, an 
IndexWriter, then access its directory, analyzer, similarity and 
maxFieldLength in the DocumentWriter constructor.  A public 
getDirectory() method would also need to be added to IndexWriter for 
this to work.

Similarly, two of SegmentMerger's constructor parameters could be 
replaced with an IndexWriter, the directory and boolean useCompoundFile.

In SegmentMerge I would replace the directory parameter with IndexWriter.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
What I  really would like to see are some best practices or some advice from
some users who are working with really large indices how they handle this
situation, or why they  don't have to  care about it or maybe why I am
completely missing the point ;-))
Many folks with really large indexes just don't permit things like 
wildcard and range searches.  For example, Google supports no wildcards 
and has only recently added limited numeric range searching.  Yahoo! 
supports neither.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
First let me explain what I found out. I'm running Lucene on a 4 CPU server.
While doing some stress tests I've noticed (by doing full thread dump) that
searching threads are blocked on the method: public FieldInfo fieldInfo(int
fieldNumber) This causes for a significant cpu idle time. 
What version of Lucene are you running?  Also, can you please send the 
stack traces of the blocked threads, or at least a description of them? 
 I'd be interested to see what context this happens in.  In particular, 
which IndexReader and Searcher/Scorer/Weight methods does it happen under?

I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList byNumber
I was able to get 110% improvement in performance (number of searches per
second).
That's impressive!  Good job finding a bottleneck!
My question is: do the fields byNumber and byName have to be synchronized
and what can happen if I'll change them to be ArrayList and HashMap which
are not synchronized ? Can this corrupt the index or the integrity of the
results?
I think that is a safe change.  FieldInfos is only modifed by 
DocumentWriter and SegmentMerger, and there is no possibility of other 
threads accessing those instances.  Please submit a patch to the 
developer mailing list.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
I use Lucene 1.4 final
Here is the thread dump for one blocked thread (If you want a full thread
dump for all threads I can do that too)
Thanks.  I think I get the point.  I recently removed a synchronization 
point higher in the stack, so that now this one shows up!

Whether or not you submit a patch, please file a bug report in Bugzilla 
with your proposed change, so that we don't lose track of this issue.

Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote:
I was going to create a new IDField class which just calls super( name, 
value, false, true, false) but noticed I was prevented because 
Field.java is final?
You don't need to subclass to do this, just a static method somewhere.
Why is this?  I can't see any harm in making it non-final...
Field and Document are not designed to be extensible.  They are 
persisted in such a way that added methods are not available when the 
field is restored.  In other words, when a field is read, it always 
constructs an instance of Field, not a subclass.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java - STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote:
So I added a few constants to my class:
new Field( name, value, NOT_STORED, INDEXED, NOT_TOKENIZED );
which IMO is a lot easier to maintain.
Why not add these constants to Field.java:
   public static final boolean STORED = true;
   public static final boolean NOT_STORED = false;
   public static final boolean INDEXED = true;
   public static final boolean NOT_INDEXED = false;
   public static final boolean TOKENIZED = true;
   public static final boolean NOT_TOKENIZED = false;
Of course you still have to remember the order but this becomes a lot 
easier to maintain.
It would be best to get the compiler to check the order.
If we change this, why not use type-safe enumerations:
http://www.javapractices.com/Topic1.cjp
The calls would look like:
new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
  private Stored() {}
  public static final Stored YES = new Stored();
  public static final Stored NO = new Stored();
}
and the compiler would check the order of arguments.
How's that?
Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java - STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Doug Cutting wrote:
The calls would look like:
new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
  private Stored() {}
  public static final Stored YES = new Stored();
  public static final Stored NO = new Stored();
}
Actually, while we're at it, Indexed and Tokenized are confounded.  A 
single entry would be better, something like:

public final class Index {
  private Index() {}
  public static final Index NO = new Index();
  public static final Index TOKENIZED = new Index();
  public static final Index UN_TOKENIZED = new Index();
}
then calls would look like just:
new Field(name, value, Store.YES, Index.TOKENIZED);
BTW, I think Stored would be better named Store too.
BooleanQuery's required and prohibited flags could get the same 
treatment, with the addition of a nested class like:

public final class Occur {
  private Occur() {}
  public static final Occur MUST_NOT = new Occur();
  public static final Occur SHOULD = new Occur();
  public static final Occur MUST = new Occur();
}
and adding a boolean clause would look like:
booleanQuery.add(new TermQuery(...), Occur.MUST);
Then we can deprecate the old methods.
Comments?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-09 Thread Doug Cutting
Kevin A. Burton wrote:
With the typical handful of fields, one should never see more than 
hundreds of files.

We only have 13 fields... Though to be honest I'm worried that even if I 
COULD do the optimize that it would run out of file handles.
Optimization doesn't open all files at once.  The most files that are 
ever opened by an IndexWriter is just:

4 + (5 + numIndexedFields) * (mergeFactor-1)
This includes during optimization.
However, when searching, an IndexReader must keep most files open.  In 
particular, the maximum number of files an unoptimized, non-compound 
IndexReader can have open is:

(5 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

A compound IndexReader, on the other hand, should open at most, just:
(mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs))
An optimized, non-compound IndexReader will open just (5 + 
numIndexedFields) files.

And an optimized, compound IndexReader should only keep one file open.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Doug Cutting
Armbrust, Daniel C. wrote:
The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it.  I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks still existed, B) - I didn't have a clue where it put the locks on the Solaris machine (since no full path was given with the error - has this been fixed?) and C) - I didn't have permission to remove her locks.
I think these problems have been fixed.  When an index is created, all 
old locks are first removed.  And when a lock cannot be obtained, it's 
full pathname is printed.  Can you replicate this with 1.4-final?

I think the locks should go back in the index, and we should fall back or give an option to put them elsewhere for the case of the read-only index.
Changing the lock location is risky.  Code which writes an index would 
not be required to alter the lock location, but code which reads it 
would be.  This can easily lead to uncoordinated access.

So it is best if the default lock location works well in most cases.  We 
try to use a temporary directory writable by all users, and attempt to 
handle situations like those you describe above.  Please tell me if you 
continue to have problems with locking.

Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote:
 The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.
That's easy to fix.  We just need to reuse the token:
public class VectorTokenStream extends TokenStream {
  private int term = -1;
  private int freq = 0;
  private Token token;
  public VectorTokenStream(String[] terms, int[] freqs) {
this.terms = terms;
this.freqs = freqs;
  }
  public Token next() {
if (freq == 0) {
  term++;
  if (term = terms.length)
return null;
  token = new Token(terms[term], 0, 0);
  freq = freqs[term];
}
freq--;
return token;
  }
}
Then only two tokens are created, as you desire.
If you for some reason don't want to create a dummy document stream, 
then you could instead implement an IndexReader that delivers a 
synthetic index for a single document.  Then use 
IndexWriter.addIndexes() to turn this into a real, FSDirectory-based 
index.  However that would be a lot more work and only very marginally 
faster.  So I'd stick with the approach I've outlined above.  (Note: 
this code has not been compiled or run.  It may have bugs.)

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
So is it possible to fix this index now?  Can I just delete the most 
recent segment that was created?  I can find this by ls -alt
Sorry, I forgot to answer your question: this should work fine.  I don't 
think you should even have to delete that segment.

Also, to elaborate on my previous comment, a mergeFactor of 5000 not 
only delays the work until the end, but it also makes the disk workload 
more seek-dominated, which is not optimal.  So I suspect a smaller merge 
factor, together with a larger minMergeDocs, will be much faster 
overall, including the final optimize().  Please tell us how it goes.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem running lucene 1.4 demo on a solaris machine (permission denied)

2004-07-08 Thread Doug Cutting
MATL (Mats Lindberg) wrote:
When i copied the lucene jar file to the solaris machine from the
windows machine i used a ftp program.
FTP probably mangled the file.  You need to use FTP's binary mode.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
No... I changed the mergeFactor back to 10 as you suggested.
Then I am confused about why it should take so long.
Did you by chance set the IndexWriter.infoStream to something, so that 
it logs merges?  If so, it would be interesting to see that output, 
especially the last entry.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
This is why I think it makes more sense to use our own java.io.tmpdir to 
be on the safe side.
I think the bug is that Tomcat changes java.io.tmpdir.  I thought that 
the point of the system property java.io.tmpdir was to have a portable 
name for /tmp on unix, c:\windows\tmp on Windows, etc.  Tomcat breaks 
that.  So must Lucene have its own way of finding the platform-specific 
temporary directory that everyone can write to?  Perhaps, but it seems a 
shame, since Java already has a standard mechanism for this, which 
Tomcat abuses...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote:
Just for my education, can you maybe elaborate on using the
implement an IndexReader that delivers a
synthetic index approach?
IndexReader is an abstract class.  It has few data fields, and few 
non-static methods that are not implemented in terms of abstract 
methods.  So, in effect, it is an interface.

When Lucene indexes a token stream it creates a single-document index 
that is then merged with other single- and multi-document indexes to 
form an index that is searched.  You could bypass the first step of this 
(indexing a token stream) by instead directly implementing all of 
IndexReader's abstract methods to return the same thing as the 
single-document index that Lucene would create.  This would be 
marginally faster, as no Token objects would be created at all.  But, 
since IndexReader has a lot of abstract methods, it would be a lot of 
work.  I didn't really mean it as a practical suggestion.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
During an optimize I assume Lucene starts writing to a new segment and 
leaves all others in place until everything is done and THEN deletes them?
That's correct.
The only settings I uses are:
targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;
the resulting index has 230k files in it :-/
Something sounds very wrong for there to be that many files.
The maximum number of files should be around:
  (7 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

With 14M documents, log_10(14M/1000) is 4, which gives, for you:
  (7 + numIndexedFields) * 36 = 230k
   7*36 + numIndexedFields*36 = 230k
   numIndexedFields = (230k - 7*36) / 36 =~ 6k
So you'd have to have around 6k unique field names to get 230k files. 
Or something else must be wrong.  Are you running on win32, where file 
deletion can be difficult?

With the typical handful of fields, one should never see more than 
hundreds of files.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
A mergeFactor of 5000 is a bad idea.  If you want to index faster, try 
increasing minMergeDocs instead.  If you have lots of memory this can 
probably be 5000 or higher.

Also, why do you optimize before you're done?  That only slows things. 
Perhaps you have to do it because you've set mergeFactor to such an 
extreme value?  I do not recommend a merge factor higher than 100.

Doug
Kevin A. Burton wrote:
I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file 
handles.  this takes TIME and of course is linear to the size of the 
index so it just gets slower by the time I complete.  It starts to crawl 
at about 3M documents.

2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't find 
it in the archives, the FAQ or the wiki.
I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
every 50k documents.

Does it make sense to just create a new IndexWriter for every 50k docs 
and then do one big optimize() at the end?

Kevin
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
Julien,
Thanks for the excellent explanation.
I think this thread points to a documentation problem.  We should 
improve the javadoc for these parameters to make it easier for folks to

In particular, the javadoc for mergeFactor should mention that very 
large values (100) are not recommended, since they can run into file 
handle limitations with FSDirectory.  The maximum number of open files 
while merging is around mergeFactor * (5 + number of indexed fields). 
Perhaps mergeFactor should be tagged an Expert parameter to discourage 
folks playing with it, as it is such a common source of problems.

The javadoc should instead encourage using minMergeDocs to increase 
indexing speed by using more memory.  This parameter is unfortunately 
poorly named.  It should really be called something like maxBufferedDocs.

Doug
Julien Nioche wrote:
It is not surprising that you run out of file handles with such a large
mergeFactor.
Before trying more complex strategies involving RAMDirectories and/or
splitting your indexation on several machines, I reckon you should try
simple things like using a low mergeFactor (eg: 10) combined with a higher
minMergeDocs (ex: 1000) and optimize only at the end of the process.
By setting a higher value to minMergeDocs, you'll index and merge with a
RAMDirectory. When the limit is reached (ex 1000) a segment is written in
the FS. MergeFactor controls the number of segments to be merged, so when
you have 10 segments on the FS (which is already 10x1000 docs), the
IndexWriter will merge them all into a single segment. This is equivalent to
an optimize I think. The process continues like that until it's finished.
Combining theses parameters should be enough to achieve good performance.
The good point of using minMergeDocs is that you make a heavy use of the
RAMDirectory used by your IndexWriter (== fast) without having to be too
careful with the RAM (which would be the case with RamDirectory). At the
same time keeping your mergeFactor low limits the risks of too many handles
problem.
- Original Message - 
From: Kevin A. Burton [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 7:44 AM
Subject: Most efficient way to index 14M documents (out of memory/file
handles)


I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file
handles.  this takes TIME and of course is linear to the size of the
index so it just gets slower by the time I complete.  It starts to crawl
at about 3M documents.
2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't find
it in the archives, the FAQ or the wiki.
I'm using an IndexWriter with a mergeFactor of 5k and then optimizing
every 50k documents.
Does it make sense to just create a new IndexWriter for every 50k docs
and then do one big optimize() at the end?
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-07 Thread Doug Cutting
John Wang wrote:
 While lucene tokenizes the words in the document, it counts the
frequency and figures out the position, we are trying to bypass this
stage: For each document, I have a set of words with a know frequency,
e.g. java (5), lucene (6) etc. (I don't care about the position, so it
can always be 0.)
 What I can do now is to create a dummy document, e.g. java java
java java java lucene lucene lucene lucene lucene and pass it to
lucene.
 This seems hacky and cumbersome. Is there a better alternative? I
browsed around in the source code, but couldn't find anything.
Write an analyzer that returns terms with the appropriate distribution.
For example:
public class VectorTokenStream extends TokenStream {
  private int term;
  private int freq;
  public VectorTokenStream(String[] terms, int[] freqs) {
this.terms = terms;
this.freqs = freqs;
  }
  public Token next() {
if (freq == 0) {
  term++;
  if (term = terms.length)
return null;
  freq = freqs[term];
}
freq--;
return new Token(terms[term], 0, 0);
  }
}
Document doc = new Document();
doc.add(Field.Text(content, ));
indexWriter.addDocument(doc, new Analyzer() {
  public TokenStream tokenStream(String field, Reader reader) {
return new VectorTokenStream(new String[] {java,lucene},
 new int[] {5,6});
  }
});
  Too bad the Field class is final, otherwise I can derive from it
and do something on that line...
Extending Field would not help.  That's why it's final.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Running OutOfMemory while optimizing and searching

2004-07-01 Thread Doug Cutting
 What do your queries look like?  The memory required
 for a query can be computed by the following equation:

 1 Byte * Number of fields in your query * Number of
 docs in your index

 So if your query searches on all 50 fields of your 3.5
 Million document index then each search would take
 about 175MB.  If your 3-4 searches run concurrently
 then that's about 525MB to 700MB chewed up at once.
That's not quite right.  If you use the same IndexSearcher (or 
IndexReader) for all of the searches, then only 175MB are used.  The 
arrays in question (the norms) are read-only and can be shared by all 
searches.

In general, the amount of memory required is:
1 byte * Number of searchable fields in your index * Number of docs in 
your index

plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
The latter are for i/o buffers.  There are a few other things, but these 
are the major ones.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problems with lucene in multithreaded environment

2004-06-07 Thread Doug Cutting
Jayant Kumar wrote:
Thanks for the patch. It helped in increasing the
search speed to a good extent.
Good.  I'll commit it.  Thanks for testing it.
But when we tried to
give about 100 queries in 10 seconds, then again we
found that after about 15 seconds, the response time
per query increased.
This still sounds very slow to me.  Is your index optimized?  What JVM 
are you using?

You might also consider ramping up your benchmark more slowly, to warm 
the filesystem's cache.  So, when you first launch the server, give it a 
few queries at a lower rate, then, after those have completed, try a 
higher rate.

We were able to simplify the searches further by
consolidating the fields in the index but that
resulted in increasing the index size to 2.5 GB as we
required fields 2-5 and fields 1-7 in different
searches.
That will slow updates a bit, but searching should be faster.
How about your range searches?  Do you know how many terms they match? 
The easiest way to determine this might be to insert a print statement 
in RangeQuery.rewrite() that shows the query before it is returned.

Our indexes are on the local disk therefor
there is no network i/o involved.
It does like file i/o is now your bottleneck.  The traces below show 
that you're using the compound file format, which combines many files 
into one.  When two threads try to read two logically different files 
(.prx and .frq below) they must sychronize when the compound format is 
used.  But if your application did not use the compound format this 
synchronization would not be required.  So you should try rebuilding 
your index with the compound format turned off.  (The fastest way to do 
this is simply to add and/or delete a single document, then re-optimize 
the index with compound format turned off.  This will cause the index to 
be re-written in non-compound format.)

Is this on linux?  If so, please try running 'iostat -x 1' while you 
perform your benchmark (iostat is installed by the 'sysstat' package). 
What percentage is the disk utilized (%util)?  What is the percentage of 
idle CPU (%idle)?  What is the rate of data that is read (rkB/s)?  If 
things really are i/o bound then you might consider spreading the data 
over multiple disks, e.g., with lvm striping or a RAID controller.

If you have a lot of RAM, then you could also consider moving certain 
files of the index onto a ramfs-based drive.  For example, moving the 
.tis, .frq and .prx can greatly improve performance.  Also, having these 
files in RAM means that the cache does not need to be warmed.

Hope this helps!
Doug
  Thread-23 prio=1 tid=0x08169f38 nid=0x2867 waiting for monitor 
entry [69bd4000..69bd48c8]
at 
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:217)
- waiting to lock 0x46f1b828 (a org.apache.lucene.store.FSInputStream)
at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at 
org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:58)
Thread-22 prio=1 tid=0x08159f78 nid=0x2866 waiting for monitor entry 
[69b53000..69b538c8]
at 
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:217)
- waiting to lock 0x46f1b828 (a org.apache.lucene.store.FSInputStream)
at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:86)
at org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Jayant Kumar wrote:
Please find enclosed jvmdump.txt which contains a dump
of our search program after about 20 seconds of
starting the program.
Also enclosed is the file queries.txt which contains
few sample search queries.
Thanks for the data.  This is exactly what I was looking for.
Thread-14 prio=1 tid=0x080a7420 nid=0x468e waiting for monitor entry 
[4d61a000..4d61ac18]
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:112)
- waiting to lock 0x44c95228 (a org.apache.lucene.index.TermInfosReader)
Thread-12 prio=1 tid=0x080a58e0 nid=0x468e waiting for monitor entry 
[4d51a000..4d51ad18]
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:112)
- waiting to lock 0x44c95228 (a org.apache.lucene.index.TermInfosReader)
These are all stuck looking terms up in the dictionary (TermInfos). 
Things would be much faster if your queries didn't have so many terms.

Query : (  (  (  (  (  FIELD1: proof OR  FIELD2: proof OR  FIELD3: proof OR  FIELD4: proof OR  FIELD5: proof OR  FIELD6: proof OR  FIELD7: proof ) AND (  FIELD1: george bush OR  FIELD2: george bush OR  FIELD3: george bush OR  FIELD4: george bush OR  FIELD5: george bush OR  FIELD6: george bush OR  FIELD7: george bush )  ) AND (  FIELD1: script OR  FIELD2: script OR  FIELD3: script OR  FIELD4: script OR  FIELD5: script OR  FIELD6: script OR  FIELD7: script )  ) AND (  (  FIELD1: san OR  FIELD2: san OR  FIELD3: san OR  FIELD4: san OR  FIELD5: san OR  FIELD6: san OR  FIELD7: san ) OR (  (  FIELD1: war OR  FIELD2: war OR  FIELD3: war OR  FIELD4: war OR  FIELD5: war OR  FIELD6: war OR  FIELD7: war ) OR (  (  FIELD1: gulf OR  FIELD2: gulf OR  FIELD3: gulf OR  FIELD4: gulf OR  FIELD5: gulf OR  FIELD6: gulf OR  FIELD7: gulf ) OR (  (  FIELD1: laden OR  FIELD2: laden OR  FIELD3: laden OR  FIELD4: laden OR  FIELD5: laden OR  FIELD6: laden OR  FIELD7: laden ) OR (  (  FIE
LD1: ttouristeat OR  FIELD2: ttouristeat OR  FIELD3: ttouristeat OR  FIELD4: 
ttouristeat OR  FIELD5: ttouristeat OR  FIELD6: ttouristeat OR  FIELD7: ttouristeat ) 
OR (  (  FIELD1: pow OR  FIELD2: pow OR  FIELD3: pow OR  FIELD4: pow OR  FIELD5: pow 
OR  FIELD6: pow OR  FIELD7: pow ) OR (  FIELD1: bin OR  FIELD2: bin OR  FIELD3: bin OR 
 FIELD4: bin OR  FIELD5: bin OR  FIELD6: bin OR  FIELD7: bin )  )  )  )  )  )  )  )  ) 
AND  RANGE: ([ 0800 TO 1100 ]) AND  (  S_IDa: (7 OR 8 OR 9 OR 10 OR 11 OR 12 OR 13 OR 
14 OR 15 OR 16 OR 17 )  or  S_IDb: (2 )  )
All your queries look for terms in fields 1-7.  If you instead combined 
the contents of fields 1-7 in a single field, and searched that field, 
then your searches would contain far fewer terms and be much faster.

Also, I don't know how many terms your RANGE queries match, but that 
could also be introducing large numbers of terms which would slow things 
down too.

But, still, you have identified a bottleneck: TermInfosReader caches a 
TermEnum and hence access to it must be synchronized.  Caching the enum 
greatly speeds sequential access to terms, e.g., when merging, 
performing range or prefix queries, etc.  Perhaps however the cache 
should be done through a ThreadLocal, giving each thread its own cache 
and obviating the need for synchronization...

Please tell me if you are able to simplify your queries and if that 
speeds things.  I'll look into a ThreadLocal-based solution too.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Doug Cutting wrote:
Please tell me if you are able to simplify your queries and if that 
speeds things.  I'll look into a ThreadLocal-based solution too.
I've attached a patch that should help with the thread contention, 
although I've not tested it extensively.

I still don't fully understand why your searches are so slow, though. 
Are the indexes stored on the local disk of the machine?  Indexes 
accessed over the network can be very slow.

Anyway, give this patch a try.  Also, if anyone else can try this and 
report back whether it makes multi-threaded searching faster, or 
anything else slower, or is buggy, that would be great.

Thanks,
Doug
Index: src/java/org/apache/lucene/index/TermInfosReader.java
===
RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v
retrieving revision 1.6
diff -u -u -r1.6 TermInfosReader.java
--- src/java/org/apache/lucene/index/TermInfosReader.java	20 May 2004 11:23:53 -	1.6
+++ src/java/org/apache/lucene/index/TermInfosReader.java	4 Jun 2004 21:45:15 -
@@ -29,7 +29,8 @@
   private String segment;
   private FieldInfos fieldInfos;
 
-  private SegmentTermEnum enumerator;
+  private ThreadLocal enumerators = new ThreadLocal();
+  private SegmentTermEnum origEnum;
   private long size;
 
   TermInfosReader(Directory dir, String seg, FieldInfos fis)
@@ -38,19 +39,19 @@
 segment = seg;
 fieldInfos = fis;
 
-enumerator = new SegmentTermEnum(directory.openFile(segment + .tis),
-			   fieldInfos, false);
-size = enumerator.size;
+origEnum = new SegmentTermEnum(directory.openFile(segment + .tis),
+   fieldInfos, false);
+size = origEnum.size;
 readIndex();
   }
 
   public int getSkipInterval() {
-return enumerator.skipInterval;
+return origEnum.skipInterval;
   }
 
   final void close() throws IOException {
-if (enumerator != null)
-  enumerator.close();
+if (origEnum != null)
+  origEnum.close();
   }
 
   /** Returns the number of term/value pairs in the set. */
@@ -58,6 +59,15 @@
 return size;
   }
 
+  private SegmentTermEnum getEnum() {
+SegmentTermEnum enum = (SegmentTermEnum)enumerators.get();
+if (enum == null) {
+  enum = terms();
+  enumerators.set(enum);
+}
+return enum;
+  }
+
   Term[] indexTerms = null;
   TermInfo[] indexInfos;
   long[] indexPointers;
@@ -102,16 +112,17 @@
   }
 
   private final void seekEnum(int indexOffset) throws IOException {
-enumerator.seek(indexPointers[indexOffset],
-	  (indexOffset * enumerator.indexInterval) - 1,
+getEnum().seek(indexPointers[indexOffset],
+	  (indexOffset * getEnum().indexInterval) - 1,
 	  indexTerms[indexOffset], indexInfos[indexOffset]);
   }
 
   /** Returns the TermInfo for a Term in the set, or null. */
-  final synchronized TermInfo get(Term term) throws IOException {
+  TermInfo get(Term term) throws IOException {
 if (size == 0) return null;
 
-// optimize sequential access: first try scanning cached enumerator w/o seeking
+// optimize sequential access: first try scanning cached enum w/o seeking
+SegmentTermEnum enumerator = getEnum();
 if (enumerator.term() != null // term is at or past current
 	 ((enumerator.prev != null  term.compareTo(enumerator.prev)  0)
 	|| term.compareTo(enumerator.term()) = 0)) {
@@ -128,6 +139,7 @@
 
   /** Scans within block for matching term. */
   private final TermInfo scanEnum(Term term) throws IOException {
+SegmentTermEnum enumerator = getEnum();
 while (term.compareTo(enumerator.term())  0  enumerator.next()) {}
 if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0)
   return enumerator.termInfo();
@@ -136,10 +148,12 @@
   }
 
   /** Returns the nth term in the set. */
-  final synchronized Term get(int position) throws IOException {
+  final Term get(int position) throws IOException {
 if (size == 0) return null;
 
-if (enumerator != null  enumerator.term() != null  position = enumerator.position 
+SegmentTermEnum enumerator = getEnum();
+if (enumerator != null  enumerator.term() != null 
+position = enumerator.position 
 	position  (enumerator.position + enumerator.indexInterval))
   return scanEnum(position);		  // can avoid seek
 
@@ -148,6 +162,7 @@
   }
 
   private final Term scanEnum(int position) throws IOException {
+SegmentTermEnum enumerator = getEnum();
 while(enumerator.position  position)
   if (!enumerator.next())
 	return null;
@@ -156,12 +171,13 @@
   }
 
   /** Returns the position of a Term in the set or -1. */
-  final synchronized long getPosition(Term term) throws IOException {
+  final long getPosition(Term term) throws IOException {
 if (size == 0) return -1;
 
 int indexOffset = getIndexOffset(term);
 seekEnum(indexOffset);
 
+SegmentTermEnum enumerator = getEnum();
 while

Re: problems with lucene in multithreaded environment

2004-06-02 Thread Doug Cutting
Jayant Kumar wrote:
We recently tested lucene with an index size of 2 GB
which has about 1,500,000 documents, each document
having about 25 fields. The frequency of search was
about 20 queries per second. This resulted in an
average response time of about 20 seconds approx
per search.
That sounds slow, unless your queries are very complex.  What are your 
queries like?

What we observed was that lucene queues
the queries and does not release them until the
results are found. so the queries that have come in
later take up about 500 seconds. Please let us know
whether there is a technique to optimize lucene in
such circumstances. 
Multiple queries executed from different threads using a single searcher 
should not queue, but should run in parallel.  A technique to find out 
where threads are queueing is to get a thread dump and see where all of 
the threads are stuck.  In Solaris and Linux, sending the JVM a SIGQUIT 
will give a thread dump.  On Windows, use Control-Break.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Memory usage

2004-05-26 Thread Doug Cutting
James Dunn wrote:
Also I search across about 50 fields but I don't use
wildcard or range queries. 
Lucene uses one byte of RAM per document per searched field, to hold the 
normalization values.  So if you search a 10M document collection with 
50 fields, then you'll end up using 500MB of RAM.

If you're using unanalyzed fields, then an easy workaround to reduce the 
number of fields is to combine many in a single field.  So, instead of, 
e.g., using an f1 field with value abc, and an f2 field with value 
efg, use a single field named f with values 1_abc and 2_efg.

We could optimize this in Lucene.  If no values of an indexed field are 
analyzed, then we could store no norms for the field and hence read none 
into memory.  This wouldn't be too hard to implement...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  1   2   3   4   >