Re: multiple collections indexing
Hi, Are lots of different combinations of collections used frequently? Probably not. If only a handful of different subsets of collections are frequently searched, then QueryFilter could be very useful. I did some test and thought the results might be interesting for others also. I ran a number of queries for different combinations of collections from single collections up to 33 collections in a query. Collections were selected by - OR combined queries on the collection id - a QueryFilter - a cached QueryFilter - MultiSearch In order to minimize effects like file system caching, I run all queries three times and counted the total search time for each method of collection selection for all passes and for the second and third pass only (assuming that caching effects will show up in the first pass only). Further I did a test, where the different methods were used on the same query one after another and another test, where I did all tests for one method as a block. The time measured contains only query preparation and search. I did not access any results. This should be independent from the query anyway. Basically I found that using the OR combined queries is worst. It takes nearly twice as much time as creating a filter for every request. Caching of filters is a good thing, it saves a lot of time. Using a multisearch is also much faster than creating a filter for every request. So it depends on the expected cache hit rate, which one is better. For the first test (different methods one after another for the same queries) I get filter: 3810 (1591) avg: 1270 (795) filter (cached): 571 (77) avg: 190 (38) or combined: 6706 (2946) avg: 2235 (1473) multi : 872 (249) avg: 290 (124) (times in ms, first number is the time for all three passes, the second number in parenthesis is the time excluding pass 1; the average times are time per pass; one pass contained 24 queries (4 different queries and 6 different combinations of collections)) For the second test the times are slightly different but the picture is the same: filter: 3447 (1675) avg: 1149 (837) filter (cached): 590 (107) avg: 196 (53) or combined: 6611 (2746) avg: 2203 (1373) multi : 504 (228) avg: 168 (114) The test was done on a 700 MHz Intel P3 box running Linux (RedHat 8.0, kernel 2.4.18) with 1 GB Ram. The combined index is ~750 MB, the separate indexes size is between 0.5 and 100 MB. greetings Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
Hi, thanks for all your answers, I think I collect some of the hints and ideas rather than commenting all of them apart. Doug Cutting writes: Morus Walter wrote: Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. Are lots of different combinations of collections used frequently? Probably not. If only a handful of different subsets of collections are frequently searched, then QueryFilter could be very useful. Well the data in question consists of german encyclopedias, dictionaries and glossaries. They are provided for a B2C website (www.xipolis.net) as well as for various B2B deals. So on the one hand there are specific combinations of collections. On the other hand all users are free to choose a subcollection they want to search. But this feature is used by at most 10% of the queries. In this approach you construct a QueryFilter for each combination of collections, passing it the collection name query. Keep the query filter around and re-use it whenever a query with that combination of collections is made. This is very fast. It uses one bit per document per filter. So if you have a million documents and eight common combinations of collections then this would use one megabyte. You could also keep a cache of QueryFilters in a LinkedHashMap (JDK 1.4). If the size of the cache exceeds a limit, throw away its eldest entry by overriding the removeEldestEntry method. That way, if any combination of collections is possible, but only a few are probable, you can just cache the common subsets as QueryFilters. Probably we should provide such a QueryFilterCache class with Lucene... Ok. I'll have a look at this. I guess one could combine this with selections based on collection ids. Thanks to John for his notes on multiple collections and number of open files. I guess it wouldn't be that worse, since our data is updated seldom (a few collections get updates once a week or once a month, most of them are updated less than once a year). So doing a separate indexing and optimization would be possible. If I understand optimization and your calculation correctly, this means that the f * log_f(N) factor goes away. Thanks to Vladimir for his explanations on wildcard queries and the clarification on the search term limitation. Thanks to Ype for his comments on multiple collections. greetings Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
Tatu Saloranta writes: On Wednesday 19 March 2003 01:44, Morus Walter wrote: This might still be a feasible thing to do, except if number of collections changes very frequently (as you need to reindex all docs, not just incremental). Well the number is slowly growing. Another possibility would be to have a new kind of Query; one to use with numeric field values (probably would be easiest to use hex numbers). In a way it'd be a specialized/optimized version of WildcardQuery. For example, one could define required bit pattern after ORing field value with mask (in your case you'd use one bit per type, and require non-interesting type flags to be zeroes, knowing that then at least one other bit, matching interesting type, is one). Actually, that's what we are doing right now with our current fulltext engine. This engine supports bitmasks in the way, that you can use bitwise and operations between an 32 bit index field and a query term. This counts as a match, when the result is not zero. And a logical and can be computed very fast. Implementing this would be fairly easy; first find the range (like RangeQuery does), and iterate over all existing terms in that range, and for each match against bit pattern, and add term if it matches the pattern. Actual search would then search pretty much like prefix, wildcard or range query, as Terms at that point have been expanded and search part need not care how they were obtained. This would make representation more compact (4 bits in a char instead of one), potentially making index bit smaller (which usually also means faster). And of course if you really want to push the limit, you could use even more efficient encoding (although, assuming indexes use UTF-8, base64 might be almost as efficient as it gets, as ascii chars only take one byte whereas upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes). Ok. The advantage would be a shorter index. But there isn't an advantage in how the query is executed. So I could get a similar advantage from my aproach 1 by simply using two character combinations for the collection name and creating a list of OR combined queries. If I use 2 characters, 0-9 and a-z I get 1369 possible combinations, which should be sufficient for quite some time. Anyway, just an idea I thought might be worth sharing, Well, that's what I was looking for :-) Thanks a lot, Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
multiple collections indexing
Hi, we are currently evaluating lucene. The data we'd like to index consists of ~ 80 collections of documents (a few hundred up to 20 documents per collection, ~ 1.5 million documents total; medium document size is in the order of 1 kB). Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. - create an index per collection and use a MultiSearcher to search all interesting indexes. - (a third on I just discovered): create a data field containing a marker for each collection x10... for the first collection x01... for the second x001000... for the third and so on. The query might use a wildcard search on this field using x?0?0... specifying '?' for each collection that should be searched on, and '0' for the others. The marker would be very long though (the number of collections is growing, so we have to keep space for new one also). So far we set up the first aproach (one index; size ~ 750 M) and this seems to work in principle and with reasonable performance. I'm not too optimistic about the second aproach. If I understand the docs correctly this would be a sequential search on each involved index and combining the results. So questions: - has anyone experience with such a setup? - are there other aproaches to deal with it? - is my expectation, that multiple indexes are worse reasonable or should we give it a try? - how is wildcard search done? Could this be an improvement? I understand that in the end, we have to check this ourselfs, but I'd appreciate any hints and advices since I couln'd find much on this issue in the docs. greetings Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
I'd actually be interested in hearing answers about this too, but from our experience: We do something similar. We have data that we have indexed per account id (100 or so). We have them separate in the case one of them blows up, down time is not acceptable so we have the data partitioned. Unlike you however we don't have the requirement of getting from arbitratry groups of accounts (we may someday create a grand index over all accounts). That being said we have found: Using a field for your collection name will be a problem since lucene has a limit of 32 (or 64?) or/and's per query I believe. With 80+ or's possible with your collections, plus the real query itself, you will run into trouble. I would assume that this is for performance reasons and is arbitrary, and could be made dynamic, or at least set to some much larger limit. Having each in it's own index is another option. I don't know the specifics of MultiSearcher performance but keep in mind that given a machine that Lucene keeps all files open and thus has open per index: so the maximum files that can be open is f* log_f(N) * S * C where f is IndexWriter.mergeFactor (default 10) where N is the number of documents (you said max 20) where S is 7 + number of fields where C is the number of your collections So that would be for 1 field: 10 * 5 * 1 * 80 = 4000 file handles open For 5 fields: 10 * 5 * 5 * 80 = 2 file handles open For 10 fields: 10 * 5 * 10 * 80 = 4 file handles open Get the picture :) Assuming you have these all on one process or machine you have to be a bit careful to make sure things don't blow up (we have a linux machine that we've modified to allow 15 file handles open to handle this). Granted you can optimize, but that takes time so it's not something we get to do all the time (some of our indexes have millions of files) so that may give you some relief. Or, obviously, you can split these amongst servers or processes. (Someone had posted awhile ago about making lucene transactional with a log such that the number of files open was always fixed but I don't know what happened to that.) The last option you have I'll have to ponder more tomorrow, so I can sleep now. This is actually one of things I'd like to see addressed -- how lucene can handle partitioned data in a more scalable manner. - Original Message - From: Morus Walter [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, March 19, 2003 12:44 AM Subject: multiple collections indexing Hi, we are currently evaluating lucene. The data we'd like to index consists of ~ 80 collections of documents (a few hundred up to 20 documents per collection, ~ 1.5 million documents total; medium document size is in the order of 1 kB). Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. - create an index per collection and use a MultiSearcher to search all interesting indexes. - (a third on I just discovered): create a data field containing a marker for each collection x10... for the first collection x01... for the second x001000... for the third and so on. The query might use a wildcard search on this field using x?0?0... specifying '?' for each collection that should be searched on, and '0' for the others. The marker would be very long though (the number of collections is growing, so we have to keep space for new one also). So far we set up the first aproach (one index; size ~ 750 M) and this seems to work in principle and with reasonable performance. I'm not too optimistic about the second aproach. If I understand the docs correctly this would be a sequential search on each involved index and combining the results. So questions: - has anyone experience with such a setup? - are there other aproaches to deal with it? - is my expectation, that multiple indexes are worse reasonable or should we give it a try? - how is wildcard search done? Could this be an improvement? I understand that in the end, we have to check this ourselfs, but I'd appreciate any hints and advices since I couln'd find much on this issue in the docs. greetings Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
Hello Morus, I'd tell, how wildcard query works: 1. First, it runs over the lexcon and collects a list of terms that satisfy the specified pattern. 2. Then it makes a boolean query joining the collected terms with or. 3. Then the constructed boolean query is used for searching. So is seems to me that using a wildcard search doesn't give any perfomance benefit in comparison with extending the query by a or-joined list of collection names, because both have almost the same complexity. Moreover, using a field for collection name will cause no problem, since though boolean query can't contain more than 32 (in release i use) required and prohibited clauses it has no limit for optional clauses, so we can join with or up to Integer.MAX_INT queries. Using filters can afford some perfomance benefit, I mean that if you could somehow create a filter faster than using a query on collection names and then use this filter with the main query. this can be approached, for example, by loading documents id's list for each collection into memory and then merging them. This will give some benefit in time, but it inefficient according to memory use, and also you'd have to write some code. Hi, we are currently evaluating lucene. The data we'd like to index consists of ~ 80 collections of documents (a few hundred up to 20 documents per collection, ~ 1.5 million documents total; medium document size is in the order of 1 kB). Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. - create an index per collection and use a MultiSearcher to search all interesting indexes. - (a third on I just discovered): create a data field containing a marker for each collection x10... for the first collection x01... for the second x001000... for the third and so on. The query might use a wildcard search on this field using x?0?0... specifying '?' for each collection that should be searched on, and '0' for the others. The marker would be very long though (the number of collections is growing, so we have to keep space for new one also). So far we set up the first aproach (one index; size ~ 750 M) and this seems to work in principle and with reasonable performance. I'm not too optimistic about the second aproach. If I understand the docs correctly this would be a sequential search on each involved index and combining the results. So questions: - has anyone experience with such a setup? - are there other aproaches to deal with it? - is my expectation, that multiple indexes are worse reasonable or should we give it a try? - how is wildcard search done? Could this be an improvement? I understand that in the end, we have to check this ourselfs, but I'd appreciate any hints and advices since I couln'd find much on this issue in the docs. greetings Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Best regards, Vladimirmailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
On Wednesday 19 March 2003 01:44, Morus Walter wrote: ... Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. - create an index per collection and use a MultiSearcher to search all interesting indexes. - (a third on I just discovered): create a data field containing a marker for each collection x10... for the first collection x01... for the second x001000... for the third and so on. The query might use a wildcard search on this field using x?0?0... specifying '?' for each collection that should be searched on, and '0' for the others. The marker would be very long though (the number of collections is growing, so we have to keep space for new one also). This might still be a feasible thing to do, except if number of collections changes very frequently (as you need to reindex all docs, not just incremental). Another possibility would be to have a new kind of Query; one to use with numeric field values (probably would be easiest to use hex numbers). In a way it'd be a specialized/optimized version of WildcardQuery. For example, one could define required bit pattern after ORing field value with mask (in your case you'd use one bit per type, and require non-interesting type flags to be zeroes, knowing that then at least one other bit, matching interesting type, is one). Implementing this would be fairly easy; first find the range (like RangeQuery does), and iterate over all existing terms in that range, and for each match against bit pattern, and add term if it matches the pattern. Actual search would then search pretty much like prefix, wildcard or range query, as Terms at that point have been expanded and search part need not care how they were obtained. This would make representation more compact (4 bits in a char instead of one), potentially making index bit smaller (which usually also means faster). And of course if you really want to push the limit, you could use even more efficient encoding (although, assuming indexes use UTF-8, base64 might be almost as efficient as it gets, as ascii chars only take one byte whereas upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes). Adding such a query would need to be done outside QueryParser (as length of bitfield field would be variable), but in your case that probably shouldn't be a problem? Anyway, just an idea I thought might be worth sharing, -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
Morus, On Wednesday 19 March 2003 00:44, Morus Walter wrote: Hi, we are currently evaluating lucene. The data we'd like to index consists of ~ 80 collections of documents (a few hundred up to 20 documents per collection, ~ 1.5 million documents total; medium document size is in the order of 1 kB). Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. - create an index per collection and use a MultiSearcher to search all interesting indexes. - (a third on I just discovered): create a data field containing a marker for each collection x10... for the first collection x01... for the second x001000... for the third and so on. The query might use a wildcard search on this field using x?0?0... specifying '?' for each collection that should be searched on, and '0' for the others. The marker would be very long though (the number of collections is growing, so we have to keep space for new one also). So far we set up the first aproach (one index; size ~ 750 M) and this seems to work in principle and with reasonable performance. I'm not too optimistic about the second aproach. If I understand the docs correctly this would be a sequential search on each involved index and combining the results. So questions: - has anyone experience with such a setup? I'm dealing with 26 collections searchable in any combination. Total index size currently about 250 MB, very little stored text, ie. with stored text the index would be about 1.2GB. I have currently no performance problems. An advantage of using multiple collections is that optimizing takes less time. A disadvantage is that you have to deal with quite a few files for all the segments (esp. non opimized segments), which can be problematic under Windows (max. total 2000 open files per machine iirc.) Whether query times would change when merging the collections I can't say. It depends on the performance of the query method used to search a fully merged index. It also depends on the distribution of terms over your collection. When a lot of your queries are on a small subset of the collection, I'd keep them non merged. It also depends on the number of disks you are using: in case you have multiple disks it's probably worthwhile to put at least one collection on each disk and use a parallelized version of the MultiSearcher. I would advise to prepare for a merge anyway by using a field identifying the collection of each document (ie. your 3rd method.) Using this 3rd option you can then either add a boolean query to all your queries to select the combination of collections, or use a filter for the combination of collections. You can use a (user) name for a collection instead of the bit mask. Lucene nicely compresses this information in its indexes anyway. - are there other aproaches to deal with it? Not that I know of. - is my expectation, that multiple indexes are worse reasonable or should we give it a try? Depends on your requirements: in case you need to update one collection while searching others, it is preferable to have to have a Lucene index per collection. Anyway it's mostly a technical decision. I have no problems with my 26 collections (on a single disk). The only user aspect of this is the way ranking is done: it uses the inverse document frequency of each query term. When you use different combinations of collections in a MultiSearcher (Lucene 1.3) the document frequency may be different for each combination because it computes the total document frequency for each term over the combination of collections queried. I'm currently using Lucene 1.2 with my own multi searcher which just sends the query off to each database and uses the document frequency of each term in each collection seperately for its ranking. Even this works fine, although merging query results is less valid in this case. When you only use a single index, the document frequency is always the same for each query term, and you may have to make sure that the terms you use to identify the collections get a very low weight. - how is wildcard search done? Could this be an improvement? I seem to miss the point. I understand that in the end, we have to check this ourselfs, but I'd appreciate any hints and advices since I couln'd find much on this issue in the docs. My pleasure, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
Morus Walter wrote: Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. Are lots of different combinations of collections used frequently? Probably not. If only a handful of different subsets of collections are frequently searched, then QueryFilter could be very useful. In this approach you construct a QueryFilter for each combination of collections, passing it the collection name query. Keep the query filter around and re-use it whenever a query with that combination of collections is made. This is very fast. It uses one bit per document per filter. So if you have a million documents and eight common combinations of collections then this would use one megabyte. You could also keep a cache of QueryFilters in a LinkedHashMap (JDK 1.4). If the size of the cache exceeds a limit, throw away its eldest entry by overriding the removeEldestEntry method. That way, if any combination of collections is possible, but only a few are probable, you can just cache the common subsets as QueryFilters. Probably we should provide such a QueryFilterCache class with Lucene... This is the approach that I would use. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]