On 10/8/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
Thanks for tackling this Mike ... I've been dreading the whole issue of Lazy Loading but your patch gives me hope. I haven't had a chance to try it out, but reading through it, it seems a lot more straight forward then I'd feared. A couple of concerns jump out at me though, starting with the biggee... : Now, there is a concern with the doc cache of SolrIndexSearcher, which : assumes it has the whole document in the cache. To maintain this : invariant, it is still the case that all the fields in a document are : loaded in a searcher.doc(i) call. However, if a field set is given to : teh method, only the given fields are loaded directly, while the rest : are loaded lazily. it doesn't look like this would actually be an invarient with your patch. Consider the case of two request handlers: X which expects/uses all fields, and Y which only uses a few fields consistently so the rest are left to lazy loading.
Good point--I hadn't considered this case.
If Y hits doc N first it puts N in cache with only partial fields, isn't X forced to use lazy loading every time after that?
Yes, but as you guess below...
Or does lazy loading not work that way? ... would the lazy fields still only be loaded once (the first time X asked for them) but then stored in the Document (which is still in the cache) ?
... lazy fields read their value once the first time it is requested, and operate as a Field thereafter.
... even if that is hte case would hte performance hit of X loading each field Lazily one at a time be more expensive then refetching completely?
I wouldn't expect there to be much of a difference. Lazy fields hold on to a stream and an offset, and operate by seek()'ing to the right position and loading the fields as normal. Now, if the lazy fields were loaded in exactly the right order, the seeks would be no-ops. In practice, this will not happen, but we will expect that all seeks (after the first one) will be in the same disk block, which will be buffered, and so the seeks will be but pointer arithmetic. But that isn't quite the right comparason, as lazy fields have to "skip" the field in the first place, which is cheap but not free (it is cheaper for binary and compressed fields than string fields). The performance gain/loss will depend heavily on which request handler gets called more often and in what order. I don't think the cost would be too great, but I'm loathe to guess.
Depending on what would happen in the situation i described, perhaps the Solr document cache should only be used by the single arg version of SolrIndexSearcher.doc? (if we do this, two arg version of readDocs needs changed to use it directly)
perhaps, but that has other problems (in the situation you describe above, every hit on the "few fields" handler would reload every document).
An alternate idea: the single arg version could check if an item found in the cache contains lazy fields and if so re-fetch and recache the full Document?
That could work though I wonder if the O(num fields) cost per document access is worth it. Perhaps the document could be stored with a "lazy" flag in the cache, to make this check O(1).
FWIW: This isn't an obscure situation, The main index I deal with has a lot of metadata documents (one per category) which are *HUGE*, and there are two custom request handlers which use those documents -- one uses every field in them, and the other uses only a single field. That second handler would be an ideal candidate to start using your new SolrIndexSearcher.doc(int,Set<String>) method, but it would suck if doing thta ment that the other handler's performance suffered because it started taking longer to fetch all of the fields)
Perhaps instead of a "lazy" flag, the number of real fields could be stored in the cache. On the next document request, if there are more than 1-2 more fields requeste than "real" in the cache, the full document is returned. How much of a problem this is depends also on how often documets are hit once in the cache. If it if more than a few times, the load-once behaviour of lazy fields should amortize out the extra cost. Incidently, I think one of the major benefits to lazy field loaded isn't necessary the retrieval cost, but the memory savings. In your example, assuming that many documents are returned by the 1-field handler which the many-field handler never touches. These document will occupy much less memory, and consequently the size of the document cache can be increased.
Other smaller issues... 1) it's not clear to me ... will optimizePreFetchDocs trigger an NPE if there are highlight fields but no uniqueKey for the schema? (same potential in HighlightUtils which has the same line)
Good point. I don't think the test suite covers this case--and I've been bitten by it before.
2) why doesn't optimizePreFetchDocs use SolrIndexSearcher.readDocs (looks like cut/paste of method body)
Avoids the allocation of the Document[] array, and is three lines (vs. two lines to allocate array and call readDocs).
3) should we be concerned about letting people specify prefixes/suffixes of the fields they want to forcably load for dynamicFields instead of just a Set<String> of names? .. or should we cross that bridge when we come to it? (I ask because we have no cache aware method that takes in a FieldSelector, just the one that takes in the Set<String>)
It would be very easy to add a parallel method which takes a FieldSelector. My only concern with that is that it might make it hard to do cache flushing heuristics like you suggested above.
And a few minor nits... * lists in javadoc comments should us HTML <li> tags so they can be read cleanly in generated docs (see SolrPluginUtils.optimizePreFetchDocs) * there are some double imports of Fieldable in several classes (it looks like maybe they were already importing it once for no reason)
Thanks for the comments! -Mike
