Re: multiple collections indexing

2003-03-21 Thread Morus Walter
Hi,
 
 Are lots of different combinations of collections used frequently? 
 Probably not.  If only a handful of different subsets of collections are 
 frequently searched, then QueryFilter could be very useful.
 
I did some test and thought the results might be interesting for others
also.

I ran a number of queries for different combinations of collections
from single collections up to 33 collections in a query.

Collections were selected by 
- OR combined queries on the collection id
- a QueryFilter
- a cached QueryFilter
- MultiSearch

In order to minimize effects like file system caching, I run all queries
three times and counted the total search time for each method of
collection selection for all passes and for the second and third pass
only (assuming that caching effects will show up in the first pass only).

Further I did a test, where the different methods were used on the same
query one after another and another test, where I did all tests for
one method as a block.
The time measured contains only query preparation and search.
I did not access any results. This should be independent from
the query anyway.

Basically I found that using the OR combined queries is worst.
It takes nearly twice as much time as creating a filter for every request.
Caching of filters is a good thing, it saves a lot of time.
Using a multisearch is also much faster than creating a filter for
every request. So it depends on the expected cache hit rate, which one
is better.

For the first test (different methods one after another for the same
queries) I get
filter:  3810 (1591) avg: 1270 (795)
filter (cached): 571 (77) avg: 190 (38)
or combined: 6706 (2946) avg: 2235 (1473)
multi  : 872 (249) avg: 290 (124)

(times in ms, first number is the time for all three passes, the second
number in parenthesis is the time excluding pass 1; the average times
are time per pass; one pass contained 24 queries (4 different queries
and 6 different combinations of collections))

For the second test the times are slightly different but the
picture is the same:
filter:  3447 (1675) avg: 1149 (837)
filter (cached): 590 (107) avg: 196 (53)
or combined: 6611 (2746) avg: 2203 (1373)
multi  : 504 (228) avg: 168 (114)

The test was done on a 700 MHz Intel P3 box running Linux (RedHat 8.0, 
kernel 2.4.18) with 1 GB Ram.
The combined index is ~750 MB, the separate indexes size is between 0.5
and 100 MB.

greetings
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-20 Thread Morus Walter
Hi,

thanks for all your answers, I think I collect some of the hints and ideas
rather than commenting all of them apart.

Doug Cutting writes:
 Morus Walter wrote:
  Searches must be able on any combination of collections.
  A typical search includes ~ 40 collections.
  
  Now the question is, how to implement this in lucene best.
  
  Currently I see basically three possibilities:
  - create a data field containing the collection name for each document
and extend the query by a or-combined list of queries on this name filed.
 
 Are lots of different combinations of collections used frequently? 
 Probably not.  If only a handful of different subsets of collections are 
 frequently searched, then QueryFilter could be very useful.
 
Well the data in question consists of german encyclopedias, dictionaries and
glossaries. They are provided for a B2C website (www.xipolis.net) as well
as for various B2B deals. So on the one hand there are specific combinations
of collections. On the other hand all users are free to choose a 
subcollection they want to search. But this feature is used by at most 10% 
of the queries.

 In this approach you construct a QueryFilter for each combination of 
 collections, passing it the collection name query.  Keep the query 
 filter around and re-use it whenever a query with that combination of 
 collections is made.  This is very fast.  It uses one bit per document 
 per filter.  So if you have a million documents and eight common 
 combinations of collections then this would use one megabyte.
 
 You could also keep a cache of QueryFilters in a LinkedHashMap (JDK 
 1.4).  If the size of the cache exceeds a limit, throw away its eldest 
 entry by overriding the removeEldestEntry method.  That way, if any 
 combination of collections is possible, but only a few are probable, you 
 can just cache the common subsets as QueryFilters.  Probably we should 
 provide such a QueryFilterCache class with Lucene...
 
Ok. I'll have a look at this.
I guess one could combine this with selections based on collection ids.


Thanks to John for his notes on multiple collections and
number of open files.
I guess it wouldn't be that worse, since our data is updated seldom
(a few collections get updates once a week or once a month, most of them
are updated less than once a year). So doing a separate indexing and
optimization would be possible.
If I understand optimization and your calculation correctly, this means
that the f * log_f(N) factor goes away.

Thanks to Vladimir for his explanations on wildcard queries and
the clarification on the search term limitation.

Thanks to Ype for his comments on multiple collections.

greetings
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-20 Thread Morus Walter
Tatu Saloranta writes:
 On Wednesday 19 March 2003 01:44, Morus Walter wrote:
 
 This might still be a feasible thing to do, except if number of collections 
 changes very frequently (as you need to reindex all docs, not just 
 incremental).
 
Well the number is slowly growing. 

 Another possibility would be to have a new kind of Query; one to use with 
 numeric field values (probably would be easiest to use hex numbers). In a way 
 it'd be a specialized/optimized version of WildcardQuery.
 
 For example, one could define required bit pattern after ORing field value 
 with mask (in your case you'd use one bit per type, and require 
 non-interesting type flags to be zeroes, knowing that then at least one other 
 bit, matching interesting type, is one).

Actually, that's what we are doing right now with our current fulltext engine. 
This engine supports bitmasks in the way, that you can use bitwise and 
operations between an 32 bit index field and a query term. This counts as
a match, when the result is not zero. And a logical and can be computed
very fast.

 Implementing this would be fairly easy; first find the range (like RangeQuery 
 does), and iterate over all existing terms in that range, and for each match 
 against bit pattern, and add term if it matches the pattern.
 
 Actual search would then search pretty much like prefix, wildcard or range 
 query, as Terms at that point have been expanded and search part need not 
 care how they were obtained.
 
 This would make representation more compact (4 bits in a char instead of one), 
 potentially making index bit smaller (which usually also means faster). And 
 of course if you really want to push the limit, you could use even more 
 efficient encoding (although, assuming indexes use UTF-8, base64 might be 
 almost as efficient as it gets, as ascii chars only take one byte whereas 
 upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes).
 
Ok. The advantage would be a shorter index. But there isn't an advantage
in how the query is executed.
So I could get a similar advantage from my aproach 1 by simply using two 
character combinations for the collection name and creating a list of
OR combined queries. If I use 2 characters, 0-9 and a-z I get 1369
possible combinations, which should be sufficient for quite some time.
 
 Anyway, just an idea I thought might be worth sharing,
 
Well, that's what I was looking for :-)

Thanks a lot,
   Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



multiple collections indexing

2003-03-19 Thread Morus Walter
Hi,

we are currently evaluating lucene.

The data we'd like to index consists of ~ 80 collections of documents
(a few hundred up to 20 documents per collection, ~ 1.5 million documents
total; medium document size is in the order of 1 kB).

Searches must be able on any combination of collections.
A typical search includes ~ 40 collections.

Now the question is, how to implement this in lucene best.

Currently I see basically three possibilities:
- create a data field containing the collection name for each document
  and extend the query by a or-combined list of queries on this name filed.
- create an index per collection and use a MultiSearcher to search all
  interesting indexes.
- (a third on I just discovered): create a data field containing a
  marker for each collection
  x10... for the first collection
  x01... for the second
  x001000... for the third
  and so on.
  The query might use a wildcard search on this field using x?0?0...
  specifying '?' for each collection that should be searched on, and '0'
  for the others.
  The marker would be very long though (the number of collections is
  growing, so we have to keep space for new one also).

So far we set up the first aproach (one index; size ~ 750 M) and this 
seems to work in principle and with reasonable performance.
I'm not too optimistic about the second aproach. If I understand the docs
correctly this would be a sequential search on each involved index and
combining the results.

So questions:
- has anyone experience with such a setup?
- are there other aproaches to deal with it?
- is my expectation, that multiple indexes are worse reasonable or should
  we give it a try?
- how is wildcard search done? Could this be an improvement?

I understand that in the end, we have to check this ourselfs, but I'd
appreciate any hints and advices since I couln'd find much on this
issue in the docs.

greetings
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-19 Thread John L Cwikla

I'd actually be interested in hearing answers about this too, but from our
experience:

We do something similar. We have data that we have indexed per account id
(100 or
so).  We have them separate in the case one of them blows up, down time is
not acceptable
so we have the data partitioned. Unlike you however we don't have the
requirement of
getting from arbitratry groups of accounts (we may someday create a grand
index over
all accounts).

That being said we have found:

Using a field for your collection name will be a problem since lucene has a
limit of
32 (or 64?) or/and's per query I believe.  With 80+ or's possible with your
collections,
plus the real query itself, you will run into trouble. I would assume that
this is for
performance reasons and is arbitrary, and could be made dynamic, or at least
set to some much larger limit.

Having each in it's own index is another option. I don't know the specifics
of
MultiSearcher performance but keep in mind that given a machine that Lucene
keeps all files open and thus has open per index:

so the maximum files that can be open is  f* log_f(N) * S * C

where f is IndexWriter.mergeFactor (default 10)
where N is the number of documents (you said max 20)
where S is 7 + number of fields
where  C is the number of your collections

So that would be for 1 field:

10 * 5 * 1 * 80 = 4000 file handles open

For 5 fields:

10 * 5 * 5 * 80 = 2 file handles open

For 10 fields:

10 * 5 * 10 * 80  = 4 file handles open

Get the picture :)  Assuming you have these all on one process or
machine you have
to be a bit careful to make sure things don't blow up (we have a linux
machine
that we've modified to allow 15 file handles open to handle this).
Granted
you can optimize, but that takes time so it's not something we get to do
all
the time (some of our indexes have millions of files) so that may give
you
some relief. Or, obviously, you can split these amongst servers or
processes.  (Someone had posted awhile ago about making lucene
transactional with a log such that the number of files open was always
fixed
but I don't know what happened to that.)

The last option you have I'll have to ponder more tomorrow, so I can
sleep now.

This is actually one of things I'd like to see addressed -- how lucene
can handle
partitioned data in a more scalable manner.


- Original Message -
From: Morus Walter [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, March 19, 2003 12:44 AM
Subject: multiple collections indexing


 Hi,

 we are currently evaluating lucene.

 The data we'd like to index consists of ~ 80 collections of documents
 (a few hundred up to 20 documents per collection, ~ 1.5 million
documents
 total; medium document size is in the order of 1 kB).

 Searches must be able on any combination of collections.
 A typical search includes ~ 40 collections.

 Now the question is, how to implement this in lucene best.

 Currently I see basically three possibilities:
 - create a data field containing the collection name for each document
   and extend the query by a or-combined list of queries on this name
filed.
 - create an index per collection and use a MultiSearcher to search all
   interesting indexes.
 - (a third on I just discovered): create a data field containing a
   marker for each collection
   x10... for the first collection
   x01... for the second
   x001000... for the third
   and so on.
   The query might use a wildcard search on this field using x?0?0...
   specifying '?' for each collection that should be searched on, and '0'
   for the others.
   The marker would be very long though (the number of collections is
   growing, so we have to keep space for new one also).

 So far we set up the first aproach (one index; size ~ 750 M) and this
 seems to work in principle and with reasonable performance.
 I'm not too optimistic about the second aproach. If I understand the docs
 correctly this would be a sequential search on each involved index and
 combining the results.

 So questions:
 - has anyone experience with such a setup?
 - are there other aproaches to deal with it?
 - is my expectation, that multiple indexes are worse reasonable or should
   we give it a try?
 - how is wildcard search done? Could this be an improvement?

 I understand that in the end, we have to check this ourselfs, but I'd
 appreciate any hints and advices since I couln'd find much on this
 issue in the docs.

 greetings
 Morus

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-19 Thread Vladimir Lukin
Hello Morus,

I'd tell, how wildcard query works:

1. First, it runs over the lexcon and collects a list of terms that
   satisfy the specified pattern.
2. Then it makes a boolean query joining the collected terms with or.
3. Then the constructed boolean query is used for searching.

So is seems to me that using a wildcard search doesn't give any perfomance benefit in
comparison with extending the query by a or-joined list of collection
names, because both have almost the same complexity.

Moreover, using a field for collection name will cause no problem,
since though boolean query can't contain more than 32 (in release i use) required and 
prohibited clauses
it has no limit for optional clauses, so we can join with or up to
Integer.MAX_INT queries.

Using filters can afford some perfomance benefit, I mean that if you
could somehow create a filter faster than using a query on collection
names and then use this filter with the main query. this can be
approached, for example, by loading documents id's list for each collection
into memory and then merging them. This will give some benefit in
time, but it inefficient according to memory use, and also you'd have
to write some code.

 Hi,

 we are currently evaluating lucene.

 The data we'd like to index consists of ~ 80 collections of documents
 (a few hundred up to 20 documents per collection, ~ 1.5 million documents
 total; medium document size is in the order of 1 kB).

 Searches must be able on any combination of collections.
 A typical search includes ~ 40 collections.

 Now the question is, how to implement this in lucene best.

 Currently I see basically three possibilities:
 - create a data field containing the collection name for each document
   and extend the query by a or-combined list of queries on this name filed.
 - create an index per collection and use a MultiSearcher to search all
   interesting indexes.
 - (a third on I just discovered): create a data field containing a
   marker for each collection
   x10... for the first collection
   x01... for the second
   x001000... for the third
   and so on.
   The query might use a wildcard search on this field using x?0?0...
   specifying '?' for each collection that should be searched on, and '0'
   for the others.
   The marker would be very long though (the number of collections is
   growing, so we have to keep space for new one also).

 So far we set up the first aproach (one index; size ~ 750 M) and this 
 seems to work in principle and with reasonable performance.
 I'm not too optimistic about the second aproach. If I understand the docs
 correctly this would be a sequential search on each involved index and
 combining the results.

 So questions:
 - has anyone experience with such a setup?
 - are there other aproaches to deal with it?
 - is my expectation, that multiple indexes are worse reasonable or should
   we give it a try?
 - how is wildcard search done? Could this be an improvement?

 I understand that in the end, we have to check this ourselfs, but I'd
 appreciate any hints and advices since I couln'd find much on this
 issue in the docs.

 greetings
 Morus

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-- 
Best regards,
 Vladimirmailto:[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-19 Thread Tatu Saloranta
On Wednesday 19 March 2003 01:44, Morus Walter wrote:
...
 Searches must be able on any combination of collections.
 A typical search includes ~ 40 collections.

 Now the question is, how to implement this in lucene best.

 Currently I see basically three possibilities:
 - create a data field containing the collection name for each document
   and extend the query by a or-combined list of queries on this name filed.
 - create an index per collection and use a MultiSearcher to search all
   interesting indexes.
 - (a third on I just discovered): create a data field containing a
   marker for each collection
   x10... for the first collection
   x01... for the second
   x001000... for the third
   and so on.
   The query might use a wildcard search on this field using x?0?0...
   specifying '?' for each collection that should be searched on, and '0'
   for the others.
   The marker would be very long though (the number of collections is
   growing, so we have to keep space for new one also).

This might still be a feasible thing to do, except if number of collections 
changes very frequently (as you need to reindex all docs, not just 
incremental).

Another possibility would be to have a new kind of Query; one to use with 
numeric field values (probably would be easiest to use hex numbers). In a way 
it'd be a specialized/optimized version of WildcardQuery.

For example, one could define required bit pattern after ORing field value 
with mask (in your case you'd use one bit per type, and require 
non-interesting type flags to be zeroes, knowing that then at least one other 
bit, matching interesting type, is one).
Implementing this would be fairly easy; first find the range (like RangeQuery 
does), and iterate over all existing terms in that range, and for each match 
against bit pattern, and add term if it matches the pattern.

Actual search would then search pretty much like prefix, wildcard or range 
query, as Terms at that point have been expanded and search part need not 
care how they were obtained.

This would make representation more compact (4 bits in a char instead of one), 
potentially making index bit smaller (which usually also means faster). And 
of course if you really want to push the limit, you could use even more 
efficient encoding (although, assuming indexes use UTF-8, base64 might be 
almost as efficient as it gets, as ascii chars only take one byte whereas 
upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes).

Adding such a query would need to be done outside QueryParser (as length of 
bitfield field would be variable), but in your case that probably shouldn't 
be a problem?

Anyway, just an idea I thought might be worth sharing,

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-19 Thread Ype Kingma
Morus,

On Wednesday 19 March 2003 00:44, Morus Walter wrote:
 Hi,

 we are currently evaluating lucene.

 The data we'd like to index consists of ~ 80 collections of documents
 (a few hundred up to 20 documents per collection, ~ 1.5 million
 documents total; medium document size is in the order of 1 kB).

 Searches must be able on any combination of collections.
 A typical search includes ~ 40 collections.

 Now the question is, how to implement this in lucene best.

 Currently I see basically three possibilities:
 - create a data field containing the collection name for each document
   and extend the query by a or-combined list of queries on this name filed.
 - create an index per collection and use a MultiSearcher to search all
   interesting indexes.
 - (a third on I just discovered): create a data field containing a
   marker for each collection
   x10... for the first collection
   x01... for the second
   x001000... for the third
   and so on.
   The query might use a wildcard search on this field using x?0?0...
   specifying '?' for each collection that should be searched on, and '0'
   for the others.
   The marker would be very long though (the number of collections is
   growing, so we have to keep space for new one also).

 So far we set up the first aproach (one index; size ~ 750 M) and this
 seems to work in principle and with reasonable performance.
 I'm not too optimistic about the second aproach. If I understand the docs
 correctly this would be a sequential search on each involved index and
 combining the results.

 So questions:
 - has anyone experience with such a setup?

I'm dealing with 26 collections searchable in any combination. Total index 
size currently about 250 MB, very little stored text, ie. with stored text 
the index would be about 1.2GB. I have currently no performance problems.

An advantage of using multiple collections is that optimizing takes less
time. A disadvantage is that you have to deal with quite a few files for
all the segments (esp. non opimized segments), which can be problematic
under Windows (max. total 2000 open files per machine iirc.)

Whether query times would change when merging the collections I can't
say. It depends on the performance of the query method used to search
a fully merged index. It also depends on the distribution of terms over
your collection. When a lot of your queries are on a small subset of the 
collection, I'd keep them non merged.
It also depends on the number of disks you are using: in case you 
have multiple disks it's probably worthwhile to put at least one
collection on each disk and use a parallelized version of the MultiSearcher.

I would advise to prepare for a merge anyway by using a field identifying
the collection of each document (ie. your 3rd method.) Using this 3rd option 
you can then either add a boolean query to all your queries to select the 
combination of collections, or use a filter for the combination of 
collections. You can use a (user) name for a collection instead of the bit
mask. Lucene nicely compresses this information in its indexes anyway.

 - are there other aproaches to deal with it?

Not that I know of.

 - is my expectation, that multiple indexes are worse reasonable or should
   we give it a try?

Depends on your requirements: in case you need to update one collection
while searching others, it is preferable to have to have a Lucene index
per collection.  Anyway it's mostly a technical decision.
I have no problems with my 26 collections (on a single disk).

The only user aspect of this is the way ranking is done:
it uses the inverse document frequency of each query term.
When you use different combinations of collections in a MultiSearcher
(Lucene 1.3) the document frequency may be different for each combination
because it computes the total document frequency for each term
over the combination of collections queried.

I'm currently using Lucene 1.2 with my own multi searcher which
just sends the query off to each database and uses the
document frequency of each term in each collection seperately
for its ranking. Even this works fine, although merging query results is less 
valid in this case.

When you only use a single index, the document frequency
is always the same for each query term, and you may have to make
sure that the terms you use to identify the collections get a
very low weight.

 - how is wildcard search done? Could this be an improvement?

I seem to miss the point.

 I understand that in the end, we have to check this ourselfs, but I'd
 appreciate any hints and advices since I couln'd find much on this
 issue in the docs.

My pleasure,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-19 Thread Doug Cutting
Morus Walter wrote:
Searches must be able on any combination of collections.
A typical search includes ~ 40 collections.
Now the question is, how to implement this in lucene best.

Currently I see basically three possibilities:
- create a data field containing the collection name for each document
  and extend the query by a or-combined list of queries on this name filed.
Are lots of different combinations of collections used frequently? 
Probably not.  If only a handful of different subsets of collections are 
frequently searched, then QueryFilter could be very useful.

In this approach you construct a QueryFilter for each combination of 
collections, passing it the collection name query.  Keep the query 
filter around and re-use it whenever a query with that combination of 
collections is made.  This is very fast.  It uses one bit per document 
per filter.  So if you have a million documents and eight common 
combinations of collections then this would use one megabyte.

You could also keep a cache of QueryFilters in a LinkedHashMap (JDK 
1.4).  If the size of the cache exceeds a limit, throw away its eldest 
entry by overriding the removeEldestEntry method.  That way, if any 
combination of collections is possible, but only a few are probable, you 
can just cache the common subsets as QueryFilters.  Probably we should 
provide such a QueryFilterCache class with Lucene...

This is the approach that I would use.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]